Hi, Talking about IO formats, I've been looking at the source code to see if there was a way to (de)serialize a matrix to a file system but could not find anything. I was thinking about implementing a method to load a matrix from a sparse format such as the one described at http://math.nist.gov/MatrixMarket/formats.html. Would that be of interest? Is there already something similar which I haven't spotted?
About Mahout / Weka : I completely agree with what Grant said. (especially about keping it lean). The same could apply to other Data Mining framework such as RapidMiner (ex-Yale). One could easily integrate Mahout into these resources as a plugin to benefit from the GUIs and other functionalities if needed. Julien 2008/8/27 Grant Ingersoll <[EMAIL PROTECTED]> > > On Aug 27, 2008, at 8:33 AM, Richard Tomsett wrote: > > There's quite a good description of WEKA and its capabilities on the >> course page for a module I took this year: >> http://www.inf.ed.ac.uk/teaching/courses/dme/html/software2.html >> >> It's more a general suite of data-mining tools rather than a tool to >> address a specific task like Taste (plus it's obviously not implemented for >> parallel processing which could be problematic for scaling up). From the >> link above: >> >> * *Advantages*: The obvious advantage of a package like Weka is that >> *a whole range of data preparation, feature selection and data >> mining algorithms are integrated*. This means that only one data >> format is needed, and trying out and comparing different >> approaches becomes really easy. The package also comes with *a >> GUI*, which should make it easier to use. >> > > Yeah, it would be good for Mahout to adopt an approach for either > translating from ARFF to our format, or just use ARFF or whatever else Weka > does, but I don't want it to preclude us from innovating where we need to > innovate. > > > >> >> * *Disadvantages*: Probably the most important disadvantage of data >> mining suites like this is that *they do not implement the newest >> techniques*. For example the MLP implemented has a very basic >> training algorithm (backprop with momentum), and the SVM only uses >> polynomial kernels, and does not support numeric estimation. ... >> *A third possible problem is scaling*. For difficult tasks on >> large datasets, the running time can become quite long, and java >> sometimes gives an OutOfMemory error. This problem can be reduced >> by using the '-mx/x/' option when calling java, where /x/ is >> memory size (eg '50m'). For large datasets it will always be >> necessary to reduce the size to be able to work within reasonable >> time limits. A fourth problem is that *the GUI does not implement >> all the possible options*. Things that could be very useful, like >> scoring of a test set, are not provided in the GUI, but can be >> called from the command line interface. So sometimes it will be >> necessary to switch between GUI and command line. Finally, *the >> data preparation and visualisation techniques offered might not be >> enough*. Most of them are very useful, but I think in most data >> mining tasks you will need more to get to know the data well and >> to get it in the right format. >> >> > From a Mahout view, we are very much aiming at addressing the scaling > issue. As for the GUI, I think that will always be a "contrib" for Mahout, > if one ever exists. My personal goal for Mahout is to keep it lean and > easily usable in a wide variety of applications. Just as Lucene has made > search a commodity in many ways, I think Mahout could enable ML to be a > commodity in 5 years. > > Also, a glaring difference between the two is Weka is GPL. I'll leave it > to you to read all the discussions on ASL vs. GPL and do not want to start > that discussion here, as there is no point. > > Last, I imagine we will all coexist nicely. Weka will be useful for many > tasks, and Mahout will be useful for many tasks and there will certainly be > overlap. > > -- DigitalPebble Ltd http://www.digitalpebble.com
