On Sun Eugen Paraschiv <[email protected]> wrote: > Hi, I'm starting to use Mahout for some text analysis work, and I was > looking at the multitude of Apache projects that are out there. I > have a question regarding the relation between Mahout and Apache > UIMA, another project that seems to be dealing with machine learning > and data mining.
UIMA is most suited for annotating and analysing unstructured data, e.g. text, but also images or video content. There are two possible cases how UIMA and Mahout might be used together: 1) Mahout operates on vectors that represent the data points. UIMA is well suited for document analysis and annotation. It is possible to use UIMA for document processing, adding a document writer that writes documents to disk in a format that can be processed by Mahout. 2) UIMA supports adding your own annotators. It should be no problem to use Mahout models and algorithms in such annotators e.g. for document classification. For the first use case Mahout devs have so far relied on Lucene's document processing capabilities - simply because there are several Lucene devs in our community. However I have seen several projects using UIMA for document pre-processing instead. So far no glue code exists - would be more than welcome though. Isabel
