I was discussing about this Lucene tokenizers/filters porting into UIMA with some guys at LuceneEurocon and I'd be happy to help out doing that. Tommaso
2011/11/9 Jens Grivolla <[email protected]> > Hi, > > inspired by the discussion started on the OpenNLP list (see > http://mail-archives.apache.**org/mod_mbox/incubator-** > opennlp-dev/201111.mbox/%**3CCAE%3D29DrYQ1YeVdQVF_Qp-** > 6aKnLubGk1r0F-Sk5Ttd8viK0c5SQ%**40mail.gmail.com%3E<http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201111.mbox/%3CCAE%3D29DrYQ1YeVdQVF_Qp-6aKnLubGk1r0F-Sk5Ttd8viK0c5SQ%40mail.gmail.com%3E>) > I would like to take the opportunity to get some feedback here. > > We are starting (slowly, and with very limited resources) to work on > integrating the tokenizers from Lucene/Solr into UIMA (see > http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>). > This would give us a quite powerful framework for tokenization including > filtering layers, etc., as well as a jFLEX based engine for defining > tokenizers. We're hoping to be able to reproduce the XML configuration used > in Solr that lets you define character filters, tokenizers, token filters, > etc. > > One issue that also came up on the thread on the OpenNLP list is that > Lucene/Solr tokenizers normally skip punctuation tokens, which are > irrelevant for search but important for many other tasks. From what we have > seen this could easily be fixed in the corresponding jFlex grammar. > > We would also need to integrate sentence splitting. We are thinking of > implementing a filter that detects abbreviations, emoticons, etc. and use > remaining punctuation tokens as sentence boundaries. > > So far we have been looking into the the Solr handling of TokenStreams, > and writing those tokens into the UIMA CAS is trivial, but have not had > time to start coding the wrapper. We are hoping to start working on this > soon and would appreciate any input that may help us in this task. > > Bye, > Jens > >
