Re: Extending mahout lucene.vector driver

Frank Scholten Sun, 22 Jan 2012 01:13:27 -0800

Yes. In the project I did we also had to cluster documents of
particular slice of the index so we added a custom query to this
lucene2seq tool.


On Sat, Jan 21, 2012 at 4:01 AM, Lance Norskog <[email protected]> wrote:
> This sounds like a Lucene query. There are a lot of Lucene coding
> resources, including 2 revisions of the book Lucene In Action.
>
> On Thu, Jan 19, 2012 at 2:15 PM, Frank Scholten <[email protected]> 
> wrote:
>> LuceneIndexToSequenceFiles lucene2Seq = new LuceneIndexToSequenceFiles();
>>
>> Configuration configuration = ... ;
>> IndexDirectory indexDirectory = ... ;
>> Path seqPath = ... ;
>> String idField = ... ;
>> String field = ... ;
>> List<String> extraFields = asList( ... );
>> Query query = ... ;
>>
>> LuceneIndexToSequenceFilesConfiguration lucene2SeqConf = new
>> LuceneIndexToSequenceFilesConfiguration(configuration,
>> indexDirectory.getFile(), seqPath, idField, field);
>> lucene2SeqConf.setExtraFields(extraFields);
>> lucene2SeqConf.setQuery(query);
>>
>> lucene2Seq.run(lucene2SeqConf);
>>
>> The seqPath variable can be passed into seq2sparse.
>>
>> Cheers,
>>
>> Frank
>>
>> On Thu, Jan 19, 2012 at 2:03 PM, Michael Kazekin
>> <[email protected]> wrote:
>>> Frank, could you please tell me how to use your lucene2seq tool?
>>>
>>>
>>>
>>>
>>> On 01/18/2012 04:57 PM, Frank Scholten wrote:
>>>>
>>>> You can use a MatchAllDocsQuery if you want to fetch all documents.
>>>>
>>>> On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin
>>>> <[email protected]>  wrote:
>>>>>
>>>>> Thank you, Frank! I'll definitely have a look on it.
>>>>>
>>>>> As far as I can see, the problem with using Lucene in clusterisation
>>>>> tasks
>>>>> is that even with queries you get access to the "tip-of-the-iceberg"
>>>>> results only, while clusterization tasks need to deal with the results as
>>>>> a
>>>>> whole.
>>>>>
>>>>>
>>>>> On 01/17/2012 09:56 PM, Frank Scholten wrote:
>>>>>>
>>>>>> Hi Michael,
>>>>>>
>>>>>> Checkout https://issues.apache.org/jira/browse/MAHOUT-944
>>>>>>
>>>>>> This is a lucene2seq tool. You can pass in fields and a lucene query and
>>>>>> it generates text sequence files.
>>>>>>
>>>>>>  From there you can use seq2sparse.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Frank
>>>>>>
>>>>>> Sorry for brevity, sent from phone
>>>>>>
>>>>>> On Jan 17, 2012, at 17:37, Michael
>>>>>> Kazekin<[email protected]>    wrote:
>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>> I am trying to extend "mahout lucene.vector" driver, so that it can be
>>>>>>> feeded with arbitrary
>>>>>>> key-value constraints on solr schema fields (and generate only a subset
>>>>>>> for
>>>>>>> mahout vectors,
>>>>>>> which seems to be a regular use case).
>>>>>>>
>>>>>>> So the best (easiest) way I see, is to create an IndexReader
>>>>>>> implementation
>>>>>>> that would allow
>>>>>>> to read the subset.
>>>>>>>
>>>>>>> The problem is that I don't know the correct way to do this.
>>>>>>>
>>>>>>> Maybe, subclassing the FilterIndexReader would solve the problem, but I
>>>>>>> don't know which
>>>>>>> methods to override to get a consistent object representation.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The driver code includes the following:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> IndexReader reader = IndexReader.open(dir, true);
>>>>>>>
>>>>>>>    Weight weight;
>>>>>>>    if ("tf".equalsIgnoreCase(weightType)) {
>>>>>>>      weight = new TF();
>>>>>>>    } else if ("tfidf".equalsIgnoreCase(weightType)) {
>>>>>>>      weight = new TFIDF();
>>>>>>>    } else {
>>>>>>>      throw new IllegalArgumentException("Weight type " + weightType + "
>>>>>>> is
>>>>>>> not supported");
>>>>>>>    }
>>>>>>>
>>>>>>>    TermInfo termInfo = new CachedTermInfo(reader, field, minDf,
>>>>>>> maxDFPercent);
>>>>>>>    VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);
>>>>>>>
>>>>>>>    LuceneIterable iterable;
>>>>>>>
>>>>>>>    if (norm == LuceneIterable.NO_NORMALIZING) {
>>>>>>>      iterable = new LuceneIterable(reader, idField, field, mapper,
>>>>>>> LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs);
>>>>>>>    } else {
>>>>>>>      iterable = new LuceneIterable(reader, idField, field, mapper,
>>>>>>> norm,
>>>>>>> maxPercentErrorDocs);
>>>>>>>    }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> It creates a SequenceFile.Writer class then and writes the "iterable"
>>>>>>> variable.
>>>>>>>
>>>>>>>
>>>>>>> Do you have any thoughts on how to inject the code in a most simple
>>>>>>> way?
>>>>>>>
>>>>
>>>
>
>
>
> --
> Lance Norskog
> [email protected]

Re: Extending mahout lucene.vector driver

Reply via email to