Re: Lucene/Solr tokenization in UIMA

Jörn Kottmann Wed, 09 Nov 2011 03:42:22 -0800

On 11/9/11 10:57 AM, Jens Grivolla wrote:

Hi,
inspired by the discussion started on the OpenNLP list (seehttp://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201111.mbox/%3CCAE%3D29DrYQ1YeVdQVF_Qp-6aKnLubGk1r0F-Sk5Ttd8viK0c5SQ%40mail.gmail.com%3E) I would like to take the opportunity to get some feedback here.

OpenNLP will not really work if you filter out certain tokens, I canunderstand that this behavior makes sensefor Lucene but for most text analysis which is usually done by UIMAcomponents (pos tagging, ner, parsing, etc.) it does not.

We are starting (slowly, and with very limited resources) to work onintegrating the tokenizers from Lucene/Solr into UIMA (seehttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters). Thiswould give us a quite powerful framework for tokenization includingfiltering layers, etc., as well as a jFLEX based engine for definingtokenizers. We're hoping to be able to reproduce the XML configurationused in Solr that lets you define character filters, tokenizers, tokenfilters, etc.

Do you want to take an input text and produce a second "filterd" sofa,or would these be added as annotations to the CAS?The UIMA concept is different to the one we have in Lucene. The idea inUIMA is to enhance a sofa with more and more analysisdata step by step. And in Lucene they transform the input data until itfits their need.

One issue that also came up on the thread on the OpenNLP list is thatLucene/Solr tokenizers normally skip punctuation tokens, which areirrelevant for search but important for many other tasks. From what wehave seen this could easily be fixed in the corresponding jFlex grammar.
We would also need to integrate sentence splitting. We are thinking ofimplementing a filter that detects abbreviations, emoticons, etc. anduse remaining punctuation tokens as sentence boundaries.

The OpenNLP sentence splitter works great and is already integrated intoUIMA. Would using it be an option for you?I am using it to process news articles in various languages togetherwith UIMA-AS.


Jörn

Re: Lucene/Solr tokenization in UIMA

Reply via email to