Hi, Speaking about Mahout GUI how useful do you see it would be?
I don't have direct experience with WEKA or similar tool GUIs but do you think that wrapping Mahout into kind of plugin for them would be easy and straightdorward? More specifically, the part of running Mahout jobs deals with managing Hadoop and/or HDFS a lot I think. This makes me think that a good GUI for Mahout would always reguire a little bit more then just simple load data file, run algorithm(s) and view/visualize results. It should include tools for managing, setting and monitoring Hadoop, HDFS ... etc. Do you agree? Or is there already anything like this included in WEKA or Yale? To me it seems that it would be nice to build a decent Mahout GUI based on NetBeans platform for it contains a lot of useful stuff: they developed a lot of visualization stuff for mobile and JSP development and integration of existing Swing components is very easy and the platform itself provides a lot of useful stuff anyway and allows easy modular architecture. I know that it will take a long time before the Mahout API sattles down (thus adoption of JDM API would be useful at some point) but do you think that GUI for Mahout would be useful and that such development is worth the effort as opposed to integration with WEKA or Yale GUI? What are the main problems/challenges with Mahout GUI that you can see? Regards, Lukas On Thu, Aug 28, 2008 at 10:11 AM, Julien Nioche < [EMAIL PROTECTED]> wrote: > Hi, > > Talking about IO formats, I've been looking at the source code to see if > there was a way to (de)serialize a matrix to a file system but could not > find anything. I was thinking about implementing a method to load a matrix > from a sparse format such as the one described at > http://math.nist.gov/MatrixMarket/formats.html. Would that be of interest? > Is there already something similar which I haven't spotted? > > About Mahout / Weka : I completely agree with what Grant said. (especially > about keping it lean). The same could apply to other Data Mining framework > such as RapidMiner (ex-Yale). One could easily integrate Mahout into these > resources as a plugin to benefit from the GUIs and other functionalities if > needed. > > Julien > > 2008/8/27 Grant Ingersoll <[EMAIL PROTECTED]> > > > > > On Aug 27, 2008, at 8:33 AM, Richard Tomsett wrote: > > > > There's quite a good description of WEKA and its capabilities on the > >> course page for a module I took this year: > >> http://www.inf.ed.ac.uk/teaching/courses/dme/html/software2.html > >> > >> It's more a general suite of data-mining tools rather than a tool to > >> address a specific task like Taste (plus it's obviously not implemented > for > >> parallel processing which could be problematic for scaling up). From the > >> link above: > >> > >> * *Advantages*: The obvious advantage of a package like Weka is that > >> *a whole range of data preparation, feature selection and data > >> mining algorithms are integrated*. This means that only one data > >> format is needed, and trying out and comparing different > >> approaches becomes really easy. The package also comes with *a > >> GUI*, which should make it easier to use. > >> > > > > Yeah, it would be good for Mahout to adopt an approach for either > > translating from ARFF to our format, or just use ARFF or whatever else > Weka > > does, but I don't want it to preclude us from innovating where we need to > > innovate. > > > > > > > >> > >> * *Disadvantages*: Probably the most important disadvantage of data > >> mining suites like this is that *they do not implement the newest > >> techniques*. For example the MLP implemented has a very basic > >> training algorithm (backprop with momentum), and the SVM only uses > >> polynomial kernels, and does not support numeric estimation. ... > >> *A third possible problem is scaling*. For difficult tasks on > >> large datasets, the running time can become quite long, and java > >> sometimes gives an OutOfMemory error. This problem can be reduced > >> by using the '-mx/x/' option when calling java, where /x/ is > >> memory size (eg '50m'). For large datasets it will always be > >> necessary to reduce the size to be able to work within reasonable > >> time limits. A fourth problem is that *the GUI does not implement > >> all the possible options*. Things that could be very useful, like > >> scoring of a test set, are not provided in the GUI, but can be > >> called from the command line interface. So sometimes it will be > >> necessary to switch between GUI and command line. Finally, *the > >> data preparation and visualisation techniques offered might not be > >> enough*. Most of them are very useful, but I think in most data > >> mining tasks you will need more to get to know the data well and > >> to get it in the right format. > >> > >> > > From a Mahout view, we are very much aiming at addressing the scaling > > issue. As for the GUI, I think that will always be a "contrib" for > Mahout, > > if one ever exists. My personal goal for Mahout is to keep it lean and > > easily usable in a wide variety of applications. Just as Lucene has made > > search a commodity in many ways, I think Mahout could enable ML to be a > > commodity in 5 years. > > > > Also, a glaring difference between the two is Weka is GPL. I'll leave it > > to you to read all the discussions on ASL vs. GPL and do not want to > start > > that discussion here, as there is no point. > > > > Last, I imagine we will all coexist nicely. Weka will be useful for many > > tasks, and Mahout will be useful for many tasks and there will certainly > be > > overlap. > > > > > > > -- > DigitalPebble Ltd > http://www.digitalpebble.com > -- http://blog.lukas-vlcek.com/
