2011/8/18 Jörn Kottmann <[email protected]>:
> On 8/18/11 12:24 PM, Olivier Grisel wrote:
>>
>> Is this better or cover more languages than what's already provided by
>> Apache Lucene? Maybe it should better be contributed to the Lucene
>> project and make it easy to use the generic, battle tested Lucene
>> analyzers / tokenizers infrastructure to generate features in OpenNLP.
>
> The OpenNLP APIs are all not designed to work on token streams, instead
> a user usually has to provide an entire sentence at once, so that does not
> make a nice fit.

One could treat each sentence as an individual token stream to make a
generic Lucene adapter.

> And since we are an NLP library I believe it is absolutly fine to implement
> our own stemming here.

True but working on a generic API adapter would make it possible to
benefit from the huge set of existing tokenizers / analyzers from the
Lucene community. Although I am aware that most of the time lucene
analyzers drop the punctuation information which is mostly useless for
Information Retrieval but often critical for NLP.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to