On 8/18/11 12:38 PM, Olivier Grisel wrote:
True but working on a generic API adapter would make it possible to benefit from the huge set of existing tokenizers / analyzers from the Lucene community. Although I am aware that most of the time lucene analyzers drop the punctuation information which is mostly useless for Information Retrieval but often critical for NLP.
As far as I know is Lucene redistributing the snowball stemmers, that would could also be an option for us, then we directly have stemmers for all languages we currently support. I do not really see a benefit for adapting Lucene analyzers, if someone wants to use a Lcuene tokenizer instead of an OpenNLP one he can simply do that, and then provide the tokenized text to OpenNLP. That is already supported. Jörn
