2011/8/18 Jörn Kottmann <[email protected]>: > On 8/18/11 12:24 PM, Olivier Grisel wrote: >> >> Is this better or cover more languages than what's already provided by >> Apache Lucene? Maybe it should better be contributed to the Lucene >> project and make it easy to use the generic, battle tested Lucene >> analyzers / tokenizers infrastructure to generate features in OpenNLP. > > The OpenNLP APIs are all not designed to work on token streams, instead > a user usually has to provide an entire sentence at once, so that does not > make a nice fit.
One could treat each sentence as an individual token stream to make a generic Lucene adapter. > And since we are an NLP library I believe it is absolutly fine to implement > our own stemming here. True but working on a generic API adapter would make it possible to benefit from the huge set of existing tokenizers / analyzers from the Lucene community. Although I am aware that most of the time lucene analyzers drop the punctuation information which is mostly useless for Information Retrieval but often critical for NLP. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
