Yes. In the project I did we also had to cluster documents of particular slice of the index so we added a custom query to this lucene2seq tool.
On Sat, Jan 21, 2012 at 4:01 AM, Lance Norskog <[email protected]> wrote: > This sounds like a Lucene query. There are a lot of Lucene coding > resources, including 2 revisions of the book Lucene In Action. > > On Thu, Jan 19, 2012 at 2:15 PM, Frank Scholten <[email protected]> > wrote: >> LuceneIndexToSequenceFiles lucene2Seq = new LuceneIndexToSequenceFiles(); >> >> Configuration configuration = ... ; >> IndexDirectory indexDirectory = ... ; >> Path seqPath = ... ; >> String idField = ... ; >> String field = ... ; >> List<String> extraFields = asList( ... ); >> Query query = ... ; >> >> LuceneIndexToSequenceFilesConfiguration lucene2SeqConf = new >> LuceneIndexToSequenceFilesConfiguration(configuration, >> indexDirectory.getFile(), seqPath, idField, field); >> lucene2SeqConf.setExtraFields(extraFields); >> lucene2SeqConf.setQuery(query); >> >> lucene2Seq.run(lucene2SeqConf); >> >> The seqPath variable can be passed into seq2sparse. >> >> Cheers, >> >> Frank >> >> On Thu, Jan 19, 2012 at 2:03 PM, Michael Kazekin >> <[email protected]> wrote: >>> Frank, could you please tell me how to use your lucene2seq tool? >>> >>> >>> >>> >>> On 01/18/2012 04:57 PM, Frank Scholten wrote: >>>> >>>> You can use a MatchAllDocsQuery if you want to fetch all documents. >>>> >>>> On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin >>>> <[email protected]> wrote: >>>>> >>>>> Thank you, Frank! I'll definitely have a look on it. >>>>> >>>>> As far as I can see, the problem with using Lucene in clusterisation >>>>> tasks >>>>> is that even with queries you get access to the "tip-of-the-iceberg" >>>>> results only, while clusterization tasks need to deal with the results as >>>>> a >>>>> whole. >>>>> >>>>> >>>>> On 01/17/2012 09:56 PM, Frank Scholten wrote: >>>>>> >>>>>> Hi Michael, >>>>>> >>>>>> Checkout https://issues.apache.org/jira/browse/MAHOUT-944 >>>>>> >>>>>> This is a lucene2seq tool. You can pass in fields and a lucene query and >>>>>> it generates text sequence files. >>>>>> >>>>>> From there you can use seq2sparse. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Frank >>>>>> >>>>>> Sorry for brevity, sent from phone >>>>>> >>>>>> On Jan 17, 2012, at 17:37, Michael >>>>>> Kazekin<[email protected]> wrote: >>>>>> >>>>>>> Hi! >>>>>>> >>>>>>> I am trying to extend "mahout lucene.vector" driver, so that it can be >>>>>>> feeded with arbitrary >>>>>>> key-value constraints on solr schema fields (and generate only a subset >>>>>>> for >>>>>>> mahout vectors, >>>>>>> which seems to be a regular use case). >>>>>>> >>>>>>> So the best (easiest) way I see, is to create an IndexReader >>>>>>> implementation >>>>>>> that would allow >>>>>>> to read the subset. >>>>>>> >>>>>>> The problem is that I don't know the correct way to do this. >>>>>>> >>>>>>> Maybe, subclassing the FilterIndexReader would solve the problem, but I >>>>>>> don't know which >>>>>>> methods to override to get a consistent object representation. >>>>>>> >>>>>>> >>>>>>> >>>>>>> The driver code includes the following: >>>>>>> >>>>>>> >>>>>>> >>>>>>> IndexReader reader = IndexReader.open(dir, true); >>>>>>> >>>>>>> Weight weight; >>>>>>> if ("tf".equalsIgnoreCase(weightType)) { >>>>>>> weight = new TF(); >>>>>>> } else if ("tfidf".equalsIgnoreCase(weightType)) { >>>>>>> weight = new TFIDF(); >>>>>>> } else { >>>>>>> throw new IllegalArgumentException("Weight type " + weightType + " >>>>>>> is >>>>>>> not supported"); >>>>>>> } >>>>>>> >>>>>>> TermInfo termInfo = new CachedTermInfo(reader, field, minDf, >>>>>>> maxDFPercent); >>>>>>> VectorMapper mapper = new TFDFMapper(reader, weight, termInfo); >>>>>>> >>>>>>> LuceneIterable iterable; >>>>>>> >>>>>>> if (norm == LuceneIterable.NO_NORMALIZING) { >>>>>>> iterable = new LuceneIterable(reader, idField, field, mapper, >>>>>>> LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs); >>>>>>> } else { >>>>>>> iterable = new LuceneIterable(reader, idField, field, mapper, >>>>>>> norm, >>>>>>> maxPercentErrorDocs); >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> It creates a SequenceFile.Writer class then and writes the "iterable" >>>>>>> variable. >>>>>>> >>>>>>> >>>>>>> Do you have any thoughts on how to inject the code in a most simple >>>>>>> way? >>>>>>> >>>> >>> > > > > -- > Lance Norskog > [email protected]
