You can use a MatchAllDocsQuery if you want to fetch all documents.
On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin <[email protected]> wrote: > Thank you, Frank! I'll definitely have a look on it. > > As far as I can see, the problem with using Lucene in clusterisation tasks > is that even with queries you get access to the "tip-of-the-iceberg" > results only, while clusterization tasks need to deal with the results as a > whole. > > > On 01/17/2012 09:56 PM, Frank Scholten wrote: >> >> Hi Michael, >> >> Checkout https://issues.apache.org/jira/browse/MAHOUT-944 >> >> This is a lucene2seq tool. You can pass in fields and a lucene query and >> it generates text sequence files. >> >> From there you can use seq2sparse. >> >> Cheers, >> >> Frank >> >> Sorry for brevity, sent from phone >> >> On Jan 17, 2012, at 17:37, Michael >> Kazekin<[email protected]> wrote: >> >>> Hi! >>> >>> I am trying to extend "mahout lucene.vector" driver, so that it can be >>> feeded with arbitrary >>> key-value constraints on solr schema fields (and generate only a subset >>> for >>> mahout vectors, >>> which seems to be a regular use case). >>> >>> So the best (easiest) way I see, is to create an IndexReader >>> implementation >>> that would allow >>> to read the subset. >>> >>> The problem is that I don't know the correct way to do this. >>> >>> Maybe, subclassing the FilterIndexReader would solve the problem, but I >>> don't know which >>> methods to override to get a consistent object representation. >>> >>> >>> >>> The driver code includes the following: >>> >>> >>> >>> IndexReader reader = IndexReader.open(dir, true); >>> >>> Weight weight; >>> if ("tf".equalsIgnoreCase(weightType)) { >>> weight = new TF(); >>> } else if ("tfidf".equalsIgnoreCase(weightType)) { >>> weight = new TFIDF(); >>> } else { >>> throw new IllegalArgumentException("Weight type " + weightType + " >>> is >>> not supported"); >>> } >>> >>> TermInfo termInfo = new CachedTermInfo(reader, field, minDf, >>> maxDFPercent); >>> VectorMapper mapper = new TFDFMapper(reader, weight, termInfo); >>> >>> LuceneIterable iterable; >>> >>> if (norm == LuceneIterable.NO_NORMALIZING) { >>> iterable = new LuceneIterable(reader, idField, field, mapper, >>> LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs); >>> } else { >>> iterable = new LuceneIterable(reader, idField, field, mapper, norm, >>> maxPercentErrorDocs); >>> } >>> >>> >>> >>> >>> It creates a SequenceFile.Writer class then and writes the "iterable" >>> variable. >>> >>> >>> Do you have any thoughts on how to inject the code in a most simple way? >>> >> >
