On 11/26/10 9:35 AM, Tommaso Teofili wrote:
Hi all,
following Burn's proposal for multimodal analysis component skeleton I also
have a couple of components to propose for inclusion inside the sandbox:
- Solr CAS Consumer - to consume CAS/types/features inside Solr fields.
This could be put inside Lucas or in a separate project
As far as I know is the main difference from a configuration point of
view, is
that Lucas defines the language analyzers inside the AE configuration
and Solr defines them in a server side xml configuration file.
In the end there might be not much which could be reused from Lucas.
Lucas is not maintained right now, and I guess that is because most
people are not interested in creating a Lucene index from a bunch of
documents.
The way we use UIMA is to process a stream of documents which are
received continuously, in this model a Solr AE fits really nicely, because
it just send the received documents to a Solr server which adds it
to the index. After a document is received it can be search with a
short delay. With Lucas that would not be possible.
I actually created a small Solr AE for doing a quick semantic search demo.
One problem I did run in is that the Solr AE really slows down my processing
pipeline. Anyway I would be happy to test your implementation and contribute
to it.
- a Simple Language Annotator - to extract language from document text,
this one can use 3 algorithms:
- Tika 0.8 language identification capability
- Alchemy language annotator
- Dictionaries of stopwords for each language
We could easily add AEs which set the language to the Tika and
Alchemy project we already have. It can also be done with OpenNLP.
Jörn