Re: Extending mahout lucene.vector driver

Frank Scholten Wed, 18 Jan 2012 04:57:55 -0800

You can use a MatchAllDocsQuery if you want to fetch all documents.


On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin
<[email protected]> wrote:
> Thank you, Frank! I'll definitely have a look on it.
>
> As far as I can see, the problem with using Lucene in clusterisation tasks
> is that even with queries you get access to the "tip-of-the-iceberg"
> results only, while clusterization tasks need to deal with the results as a
> whole.
>
>
> On 01/17/2012 09:56 PM, Frank Scholten wrote:
>>
>> Hi Michael,
>>
>> Checkout https://issues.apache.org/jira/browse/MAHOUT-944
>>
>> This is a lucene2seq tool. You can pass in fields and a lucene query and
>> it generates text sequence files.
>>
>>  From there you can use seq2sparse.
>>
>> Cheers,
>>
>> Frank
>>
>> Sorry for brevity, sent from phone
>>
>> On Jan 17, 2012, at 17:37, Michael
>> Kazekin<[email protected]>  wrote:
>>
>>> Hi!
>>>
>>> I am trying to extend "mahout lucene.vector" driver, so that it can be
>>> feeded with arbitrary
>>> key-value constraints on solr schema fields (and generate only a subset
>>> for
>>> mahout vectors,
>>> which seems to be a regular use case).
>>>
>>> So the best (easiest) way I see, is to create an IndexReader
>>> implementation
>>> that would allow
>>> to read the subset.
>>>
>>> The problem is that I don't know the correct way to do this.
>>>
>>> Maybe, subclassing the FilterIndexReader would solve the problem, but I
>>> don't know which
>>> methods to override to get a consistent object representation.
>>>
>>>
>>>
>>> The driver code includes the following:
>>>
>>>
>>>
>>> IndexReader reader = IndexReader.open(dir, true);
>>>
>>>    Weight weight;
>>>    if ("tf".equalsIgnoreCase(weightType)) {
>>>      weight = new TF();
>>>    } else if ("tfidf".equalsIgnoreCase(weightType)) {
>>>      weight = new TFIDF();
>>>    } else {
>>>      throw new IllegalArgumentException("Weight type " + weightType + "
>>> is
>>> not supported");
>>>    }
>>>
>>>    TermInfo termInfo = new CachedTermInfo(reader, field, minDf,
>>> maxDFPercent);
>>>    VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);
>>>
>>>    LuceneIterable iterable;
>>>
>>>    if (norm == LuceneIterable.NO_NORMALIZING) {
>>>      iterable = new LuceneIterable(reader, idField, field, mapper,
>>> LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs);
>>>    } else {
>>>      iterable = new LuceneIterable(reader, idField, field, mapper, norm,
>>> maxPercentErrorDocs);
>>>    }
>>>
>>>
>>>
>>>
>>> It creates a SequenceFile.Writer class then and writes the "iterable"
>>> variable.
>>>
>>>
>>> Do you have any thoughts on how to inject the code in a most simple way?
>>>
>>
>

Re: Extending mahout lucene.vector driver

Reply via email to