Hi,
inspired by the discussion started on the OpenNLP list (see
http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201111.mbox/%3CCAE%3D29DrYQ1YeVdQVF_Qp-6aKnLubGk1r0F-Sk5Ttd8viK0c5SQ%40mail.gmail.com%3E
) I would like to take the opportunity to get some feedback here.
We are starting (slowly, and with very limited resources) to work on
integrating the tokenizers from Lucene/Solr into UIMA (see
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters). This
would give us a quite powerful framework for tokenization including
filtering layers, etc., as well as a jFLEX based engine for defining
tokenizers. We're hoping to be able to reproduce the XML configuration
used in Solr that lets you define character filters, tokenizers, token
filters, etc.
One issue that also came up on the thread on the OpenNLP list is that
Lucene/Solr tokenizers normally skip punctuation tokens, which are
irrelevant for search but important for many other tasks. From what we
have seen this could easily be fixed in the corresponding jFlex grammar.
We would also need to integrate sentence splitting. We are thinking of
implementing a filter that detects abbreviations, emoticons, etc. and
use remaining punctuation tokens as sentence boundaries.
So far we have been looking into the the Solr handling of TokenStreams,
and writing those tokens into the UIMA CAS is trivial, but have not had
time to start coding the wrapper. We are hoping to start working on
this soon and would appreciate any input that may help us in this task.
Bye,
Jens
- Lucene/Solr tokenization in UIMA Jens Grivolla
-