Re: Lucene/Solr tokenization in UIMA

Tommaso Teofili Wed, 09 Nov 2011 02:02:13 -0800

I was discussing about this Lucene tokenizers/filters porting into UIMA
with some guys at LuceneEurocon and I'd be happy to help out doing that.
Tommaso


2011/11/9 Jens Grivolla <[email protected]>

> Hi,
>
> inspired by the discussion started on the OpenNLP list (see
> http://mail-archives.apache.**org/mod_mbox/incubator-**
> opennlp-dev/201111.mbox/%**3CCAE%3D29DrYQ1YeVdQVF_Qp-**
> 6aKnLubGk1r0F-Sk5Ttd8viK0c5SQ%**40mail.gmail.com%3E<http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201111.mbox/%3CCAE%3D29DrYQ1YeVdQVF_Qp-6aKnLubGk1r0F-Sk5Ttd8viK0c5SQ%40mail.gmail.com%3E>)
>  I would like to take the opportunity to get some feedback here.
>
> We are starting (slowly, and with very limited resources) to work on
> integrating the tokenizers from Lucene/Solr into UIMA (see
> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>).
>  This would give us a quite powerful framework for tokenization including
> filtering layers, etc., as well as a jFLEX based engine for defining
> tokenizers. We're hoping to be able to reproduce the XML configuration used
> in Solr that lets you define character filters, tokenizers, token filters,
> etc.
>
> One issue that also came up on the thread on the OpenNLP list is that
> Lucene/Solr tokenizers normally skip punctuation tokens, which are
> irrelevant for search but important for many other tasks. From what we have
> seen this could easily be fixed in the corresponding jFlex grammar.
>
> We would also need to integrate sentence splitting. We are thinking of
> implementing a filter that detects abbreviations, emoticons, etc. and use
> remaining punctuation tokens as sentence boundaries.
>
> So far we have been looking into the the Solr handling of TokenStreams,
> and writing those tokens into the UIMA CAS is trivial, but have not had
> time to start coding the wrapper.  We are hoping to start working on this
> soon and would appreciate any input that may help us in this task.
>
> Bye,
> Jens
>
>

Re: Lucene/Solr tokenization in UIMA

Reply via email to