Thanks for your advice! I know about this functionality, but my problem is that I need to cluster very different "slices" of potentially huge index (corpora of texts).

So I thought that there is a fast way to obtain such a "slice", while having only one
index (instead of creating an index each time I need to make a "slice").

On 01/18/2012 04:57 PM, Frank Scholten wrote:
You can use a MatchAllDocsQuery if you want to fetch all documents.

On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin
<[email protected]>  wrote:
Thank you, Frank! I'll definitely have a look on it.

As far as I can see, the problem with using Lucene in clusterisation tasks
is that even with queries you get access to the "tip-of-the-iceberg"
results only, while clusterization tasks need to deal with the results as a
whole.


On 01/17/2012 09:56 PM, Frank Scholten wrote:
Hi Michael,

Checkout https://issues.apache.org/jira/browse/MAHOUT-944

This is a lucene2seq tool. You can pass in fields and a lucene query and
it generates text sequence files.

  From there you can use seq2sparse.

Cheers,

Frank

Sorry for brevity, sent from phone

On Jan 17, 2012, at 17:37, Michael
Kazekin<[email protected]>    wrote:

Hi!

I am trying to extend "mahout lucene.vector" driver, so that it can be
feeded with arbitrary
key-value constraints on solr schema fields (and generate only a subset
for
mahout vectors,
which seems to be a regular use case).

So the best (easiest) way I see, is to create an IndexReader
implementation
that would allow
to read the subset.

The problem is that I don't know the correct way to do this.

Maybe, subclassing the FilterIndexReader would solve the problem, but I
don't know which
methods to override to get a consistent object representation.



The driver code includes the following:



IndexReader reader = IndexReader.open(dir, true);

    Weight weight;
    if ("tf".equalsIgnoreCase(weightType)) {
      weight = new TF();
    } else if ("tfidf".equalsIgnoreCase(weightType)) {
      weight = new TFIDF();
    } else {
      throw new IllegalArgumentException("Weight type " + weightType + "
is
not supported");
    }

    TermInfo termInfo = new CachedTermInfo(reader, field, minDf,
maxDFPercent);
    VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);

    LuceneIterable iterable;

    if (norm == LuceneIterable.NO_NORMALIZING) {
      iterable = new LuceneIterable(reader, idField, field, mapper,
LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs);
    } else {
      iterable = new LuceneIterable(reader, idField, field, mapper, norm,
maxPercentErrorDocs);
    }




It creates a SequenceFile.Writer class then and writes the "iterable"
variable.


Do you have any thoughts on how to inject the code in a most simple way?



Reply via email to