Re: Lucene/Solr tokenization in UIMA

Jens Grivolla Wed, 09 Nov 2011 04:00:13 -0800

On 11/09/2011 12:41 PM, Jörn Kottmann wrote:

On 11/9/11 10:57 AM, Jens Grivolla wrote:

Hi,


inspired by the discussion started on the OpenNLP list (see
http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201111.mbox/%3CCAE%3D29DrYQ1YeVdQVF_Qp-6aKnLubGk1r0F-Sk5Ttd8viK0c5SQ%40mail.gmail.com%3E
) I would like to take the opportunity to get some feedback here.

OpenNLP will not really work if you filter out certain tokens, I can
understand that this behavior makes sense
for Lucene but for most text analysis which is usually done by UIMA
components (pos tagging, ner, parsing, etc.) it does not.

We are starting (slowly, and with very limited resources) to work on
integrating the tokenizers from Lucene/Solr into UIMA (see
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters). This
would give us a quite powerful framework for tokenization including
filtering layers, etc., as well as a jFLEX based engine for defining
tokenizers. We're hoping to be able to reproduce the XML configuration
used in Solr that lets you define character filters, tokenizers, token
filters, etc.

Do you want to take an input text and produce a second "filterd" sofa,
or would these be added as annotations to the CAS?
The UIMA concept is different to the one we have in Lucene. The idea in
UIMA is to enhance a sofa with more and more analysis
data step by step. And in Lucene they transform the input data until it
fits their need.

The idea is to add Token (and Sentence) annotations to the CAS. The(transformed) TokenStream in Lucene has a reference to the originalcharacter offsets meaning that it can easily be added as annotations.

We have been playing with using a "ValuedToken" allowing us to havetoken values that don't have to correspond to the coveredText of thetoken, so that the transformations done by Lucene would be reflectedthere. As a proof of concept, we have adapted the OpenNLP POS tagger soit can optionally take a TokenFeature instead of the coveredText.

We are interested in this in order to deal nicely with clitics andcontractions ("can't" could be two tokens with values "can" and "not").

We would also like to reflect the token type from Lucene in the Tokenannotation in UIMA, so that we directly have the information that atoken was e.g. detected using the URL rule or the Emoticon rule.

One issue that also came up on the thread on the OpenNLP list is that
Lucene/Solr tokenizers normally skip punctuation tokens, which are
irrelevant for search but important for many other tasks. From what we
have seen this could easily be fixed in the corresponding jFlex grammar.

We would also need to integrate sentence splitting. We are thinking of
implementing a filter that detects abbreviations, emoticons, etc. and
use remaining punctuation tokens as sentence boundaries.

The OpenNLP sentence splitter works great and is already integrated into
UIMA. Would using it be an option for you?
I am using it to process news articles in various languages together
with UIMA-AS.

That's what we're using right now, migrating from the Julielab wrappersto the official OpenNLP integration. However, when dealing with usergenerated content (UGC), it would sometimes be helpful to have arule-based tokenizer and sentence splitter so we can adapt it tospecific needs (emoticons, hashtags, URLs, ...)

Using the statistical tokenizer and sentence splitter we are havingimportant problems dealing with sentence breaks that have no punctuationmark, and we sometimes need to be able to specify explicitly whetherline breaks are sentence boundaries or not depending on the input data.

As far as we could see there is no easy to use rule-based tokenizer andsentence splitter currently available for UIMA.


Jens

Re: Lucene/Solr tokenization in UIMA

Reply via email to