Eugen, There are several very closely related projects here (from the standpoint of Mahout). These include Hadoop (required for scaling several Mahout programs), Lucene (often used to collect documents), Tika (useful in conjunction with Lucene to extract and process text) and, as you note, UIMA.
While all of these projects have something to do with data mining and unstructured text, the fairly simple dividing line is generally that if it is to do with the data itself or the computing platform it is UIMA, Lucene or Hadoop while if it is to do with the actual mathematics involved in the data mining, it will be Mahout doing the work. As Isabel says, there is little explicit glue code available but integrating software from these projects is not typically very difficult. There is a huge variety of ways to do this, however, so it is hard to anticipate what use cases are really important. If you have a use case, please talk about it. On Sun, Jun 13, 2010 at 6:20 AM, Eugen Paraschiv <[email protected]>wrote: > Hi, I'm starting to use Mahout for some text analysis work, and I was > looking at the multitude of Apache projects that are out there. I have a > question regarding the relation between Mahout and Apache UIMA, another > project that seems to be dealing with machine learning and data mining. > There may not be any explicit relation, none that I could find anyway, and > I > don't know if Mahout addresses or will ever address the topic of analysis > and mining of unstructured content, or if it's outside the scope of the > project. So, is there this a direction Mahout will evolve towards in the > future? Thanks. Eugen. >
