Hi,

inspired by the discussion started on the OpenNLP list (see http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201111.mbox/%3CCAE%3D29DrYQ1YeVdQVF_Qp-6aKnLubGk1r0F-Sk5Ttd8viK0c5SQ%40mail.gmail.com%3E ) I would like to take the opportunity to get some feedback here.

We are starting (slowly, and with very limited resources) to work on integrating the tokenizers from Lucene/Solr into UIMA (see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters). This would give us a quite powerful framework for tokenization including filtering layers, etc., as well as a jFLEX based engine for defining tokenizers. We're hoping to be able to reproduce the XML configuration used in Solr that lets you define character filters, tokenizers, token filters, etc.

One issue that also came up on the thread on the OpenNLP list is that Lucene/Solr tokenizers normally skip punctuation tokens, which are irrelevant for search but important for many other tasks. From what we have seen this could easily be fixed in the corresponding jFlex grammar.

We would also need to integrate sentence splitting. We are thinking of implementing a filter that detects abbreviations, emoticons, etc. and use remaining punctuation tokens as sentence boundaries.

So far we have been looking into the the Solr handling of TokenStreams, and writing those tokens into the UIMA CAS is trivial, but have not had time to start coding the wrapper. We are hoping to start working on this soon and would appreciate any input that may help us in this task.

Bye,
Jens

Reply via email to