This sounds like a Lucene query. There are a lot of Lucene coding resources, including 2 revisions of the book Lucene In Action.
On Thu, Jan 19, 2012 at 2:15 PM, Frank Scholten <[email protected]> wrote: > LuceneIndexToSequenceFiles lucene2Seq = new LuceneIndexToSequenceFiles(); > > Configuration configuration = ... ; > IndexDirectory indexDirectory = ... ; > Path seqPath = ... ; > String idField = ... ; > String field = ... ; > List<String> extraFields = asList( ... ); > Query query = ... ; > > LuceneIndexToSequenceFilesConfiguration lucene2SeqConf = new > LuceneIndexToSequenceFilesConfiguration(configuration, > indexDirectory.getFile(), seqPath, idField, field); > lucene2SeqConf.setExtraFields(extraFields); > lucene2SeqConf.setQuery(query); > > lucene2Seq.run(lucene2SeqConf); > > The seqPath variable can be passed into seq2sparse. > > Cheers, > > Frank > > On Thu, Jan 19, 2012 at 2:03 PM, Michael Kazekin > <[email protected]> wrote: >> Frank, could you please tell me how to use your lucene2seq tool? >> >> >> >> >> On 01/18/2012 04:57 PM, Frank Scholten wrote: >>> >>> You can use a MatchAllDocsQuery if you want to fetch all documents. >>> >>> On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin >>> <[email protected]> wrote: >>>> >>>> Thank you, Frank! I'll definitely have a look on it. >>>> >>>> As far as I can see, the problem with using Lucene in clusterisation >>>> tasks >>>> is that even with queries you get access to the "tip-of-the-iceberg" >>>> results only, while clusterization tasks need to deal with the results as >>>> a >>>> whole. >>>> >>>> >>>> On 01/17/2012 09:56 PM, Frank Scholten wrote: >>>>> >>>>> Hi Michael, >>>>> >>>>> Checkout https://issues.apache.org/jira/browse/MAHOUT-944 >>>>> >>>>> This is a lucene2seq tool. You can pass in fields and a lucene query and >>>>> it generates text sequence files. >>>>> >>>>> From there you can use seq2sparse. >>>>> >>>>> Cheers, >>>>> >>>>> Frank >>>>> >>>>> Sorry for brevity, sent from phone >>>>> >>>>> On Jan 17, 2012, at 17:37, Michael >>>>> Kazekin<[email protected]> wrote: >>>>> >>>>>> Hi! >>>>>> >>>>>> I am trying to extend "mahout lucene.vector" driver, so that it can be >>>>>> feeded with arbitrary >>>>>> key-value constraints on solr schema fields (and generate only a subset >>>>>> for >>>>>> mahout vectors, >>>>>> which seems to be a regular use case). >>>>>> >>>>>> So the best (easiest) way I see, is to create an IndexReader >>>>>> implementation >>>>>> that would allow >>>>>> to read the subset. >>>>>> >>>>>> The problem is that I don't know the correct way to do this. >>>>>> >>>>>> Maybe, subclassing the FilterIndexReader would solve the problem, but I >>>>>> don't know which >>>>>> methods to override to get a consistent object representation. >>>>>> >>>>>> >>>>>> >>>>>> The driver code includes the following: >>>>>> >>>>>> >>>>>> >>>>>> IndexReader reader = IndexReader.open(dir, true); >>>>>> >>>>>> Weight weight; >>>>>> if ("tf".equalsIgnoreCase(weightType)) { >>>>>> weight = new TF(); >>>>>> } else if ("tfidf".equalsIgnoreCase(weightType)) { >>>>>> weight = new TFIDF(); >>>>>> } else { >>>>>> throw new IllegalArgumentException("Weight type " + weightType + " >>>>>> is >>>>>> not supported"); >>>>>> } >>>>>> >>>>>> TermInfo termInfo = new CachedTermInfo(reader, field, minDf, >>>>>> maxDFPercent); >>>>>> VectorMapper mapper = new TFDFMapper(reader, weight, termInfo); >>>>>> >>>>>> LuceneIterable iterable; >>>>>> >>>>>> if (norm == LuceneIterable.NO_NORMALIZING) { >>>>>> iterable = new LuceneIterable(reader, idField, field, mapper, >>>>>> LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs); >>>>>> } else { >>>>>> iterable = new LuceneIterable(reader, idField, field, mapper, >>>>>> norm, >>>>>> maxPercentErrorDocs); >>>>>> } >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> It creates a SequenceFile.Writer class then and writes the "iterable" >>>>>> variable. >>>>>> >>>>>> >>>>>> Do you have any thoughts on how to inject the code in a most simple >>>>>> way? >>>>>> >>> >> -- Lance Norskog [email protected]
