I think that alternatives to traditional map-reduce are really important for ML. If you think about it, it is now common for 30-40 machine clusters to have a terabyte of RAM and it is actually slightly unusual to have ML datasets that large. As such, BSP-like compute models have a lot to recommend them.
Also, there are implementations of BSP or pregel that allow the computation to exceed memory by requiring all nodes to be serializable. Only the highly active ones are kept in memory. This allows a graceful transition from complete memory-residency to something more like traditional hadoop-style map-reduce. Another interesting model is that of Spark. I can well imagine that much of what we do could be replaced with very small Spark programs which could be composed much more easily than our current map-reduce command-line stuff could be glued together. Lots of codes should experience two orders of magnitude speedup from the use of these alternative systems. The arrival of map-reduce 2.0 (or the equivalent use of Mesos) should actually liberate us from the tyranny of a single compute model, but it does bring us to the point of having to decide how much of this sort of thing we want to depend on from another project or how much we want to take on ourselves. For instance, it might well be possible to co-opt the Spark community as part of our own ... their purpose was to support machine learning and joining forces might help achieve that on a broader scale than previously envisioned. Graphlab is an interesting case as well. On Sun, Sep 4, 2011 at 11:42 PM, Jake Mannix <[email protected]> wrote: > Hey gang, > > Has anyone here played much with > Giraph<http://incubator.apache.org/giraph/>(currently now in the > Apache Incubator)? One of my co-workers ran it on our > corporate Hadoop cluster this past weekend, and found it did a very fast > PageRank computation (far faster than even well-tuned M/R code on the same > data), and it worked pretty close to out-of-the box. Seems like that style > of computation (in-memory distributed datasets), as used by Giraph (and the > recently-discussed-on-this-list GraphLab <http://graphlab.org/>, and > Spark<http://www.spark-project.org/>, and > Twister <http://www.iterativemapreduce.org/>, and Vowpal > Wabbit<http://hunch.net/~vw/>, > and probably a few others) is more and more the way to go for a lot of the > things we want to do - scalable machine learning. "RAM is the new Disk, > and > Disk is the new Tape" after all... > > Giraph in particular seems nice, in that it runs on top of "old fashioned" > Hadoop - it takes up (long-lived) Mapper slots on your regular cluster, > spins up a ZK cluster if you don't supply the location of one, and is all > in > java (which may be a minus, for some people, I guess, but having to run > some > big exec'ed out C++ code (GraphLab, VW), or run on-top of (admittedly > awesome) Mesos (Spark [which while running on the JVM, is also in Scala]), > or run its own totally custom inter-server communication and data > structures > (Twister and many of the others)). > > Seems we should be not just supportive of this kind of thing, but try and > find some common ground and integration points. > > -jake >
