Adding stemmers would be nice, and it could be a fairly easy path to bringing in new developers since it is pretty much independent of other components and easy to test.
However, I would also note that it would be great to get real morphological analysis in there. There is a lot of recent interest in the NLP research community toward learning morphological analyzers, and perhaps that can eventually make its way into OpenNLP. Jason On Thu, Aug 18, 2011 at 5:52 AM, Jörn Kottmann <[email protected]> wrote: > On 8/18/11 12:38 PM, Olivier Grisel wrote: > >> True but working on a generic API adapter would make it possible to >> benefit from the huge set of existing tokenizers / analyzers from the >> Lucene community. Although I am aware that most of the time lucene >> analyzers drop the punctuation information which is mostly useless for >> Information Retrieval but often critical for NLP. >> > > As far as I know is Lucene redistributing the snowball stemmers, > that would could also be an option for us, then we directly have > stemmers for all languages we currently support. > > I do not really see a benefit for adapting Lucene analyzers, > if someone wants to use a Lcuene tokenizer instead of an OpenNLP > one he can simply do that, and then provide the > tokenized text to OpenNLP. That is already supported. > > Jörn > -- Jason Baldridge Assistant Professor, Department of Linguistics The University of Texas at Austin http://www.jasonbaldridge.com http://twitter.com/jasonbaldridge
