On Mon, Sep 5, 2011 at 2:13 AM, Sean Owen <[email protected]> wrote: > My high-level view is that Hadoop was very excellent for its intended use > case, and that because of this, people have abused it to do things quite > unlike what it was designed for. It's amazing that a glorified logs > processing framework could do anything like machine learning well. Mahout > embodies that interesting struggle. >
Yeah, and the more I try to do straightforward things on raw M/R, the more often I get warning looks from my cluster admins who ask me if I really need to run iterative algorithms with 10 TB of intermediate data (between each step...). > I can only believe that most any of the "next gen" frameworks discussed > here, which are necessarily more general-purpose, will be better for things > like machine learning. More general purpose? Or less? > I am not so interesting in MR 2.0 -- nothing wrong > with it just not something better conceptually for machine learning. I like > projects like Ciel from MS Research -- simply more general purpose graph- > and data-flow-oriented frameworks. > Yeah, I'll believe the hype on MR 2.0 when I see it (er, the "it" being "something more than the hype"). > I personally believe that while Mahout *could* be anything, that it's > reached about the level of scope it can possibly sustain given the amount > of > effort coming in, in trying to do something interesting on top of > MapReduce. > This will be useful for a couple years to come yet. > Definitely useful, but also reaching a point of limitedness, given the ratio of available RAM to useful data size (as Ted mentioned). > That is to say: I think it will be interesting to explore another > machine-learning-at-scale project in 2 years or so on top of one of these > next-gen frameworks. > (Was that the question?) > Well, I think it's actually the time to start working with some of the more promising ones *now* before they become their own fully fledged communities and have their own technical debt which is hard to interoperate with. As I see it, things like GraphLab and VW, being off the JVM, require much more work from the other community, and as such the best we can do is help test out the integration layers, for the time being. NextGen MR is something to think about in the future, as you say, and maybe Spark is closer, but also requires mesos, so is also more "future", but Giraph is very new, very small community, and very easily adoptable, I think. -jake
