Lucene/Solr tokenization in UIMA

Jens Grivolla Wed, 09 Nov 2011 01:57:58 -0800

Hi,

inspired by the discussion started on the OpenNLP list (seehttp://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201111.mbox/%3CCAE%3D29DrYQ1YeVdQVF_Qp-6aKnLubGk1r0F-Sk5Ttd8viK0c5SQ%40mail.gmail.com%3E) I would like to take the opportunity to get some feedback here.

We are starting (slowly, and with very limited resources) to work onintegrating the tokenizers from Lucene/Solr into UIMA (seehttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters). Thiswould give us a quite powerful framework for tokenization includingfiltering layers, etc., as well as a jFLEX based engine for definingtokenizers. We're hoping to be able to reproduce the XML configurationused in Solr that lets you define character filters, tokenizers, tokenfilters, etc.

One issue that also came up on the thread on the OpenNLP list is thatLucene/Solr tokenizers normally skip punctuation tokens, which areirrelevant for search but important for many other tasks. From what wehave seen this could easily be fixed in the corresponding jFlex grammar.

We would also need to integrate sentence splitting. We are thinking ofimplementing a filter that detects abbreviations, emoticons, etc. anduse remaining punctuation tokens as sentence boundaries.

So far we have been looking into the the Solr handling of TokenStreams,and writing those tokens into the UIMA CAS is trivial, but have not hadtime to start coding the wrapper. We are hoping to start working onthis soon and would appreciate any input that may help us in this task.


Bye,
Jens

Lucene/Solr tokenization in UIMA

Reply via email to