2011/9/5 Ted Dunning <[email protected]> > I think that alternatives to traditional map-reduce are really important > for > ML. If you think about it, it is now common for 30-40 machine clusters to > have a terabyte of RAM and it is actually slightly unusual to have ML > datasets that large. As such, BSP-like compute models have a lot to > recommend them. > > Also, there are implementations of BSP or pregel that allow the computation > to exceed memory by requiring all nodes to be serializable. Only the > highly > active ones are kept in memory. This allows a graceful transition from > complete memory-residency to something more like traditional hadoop-style > map-reduce. >
in regards of that, you may have a look at Apache Hama which is based on BSP model but also offers a Pregel-like API: http://incubator.apache.org/hama/ My 2 cents, Tommaso > > Another interesting model is that of Spark. I can well imagine that much > of > what we do could be replaced with very small Spark programs which could be > composed much more easily than our current map-reduce command-line stuff > could be glued together. Lots of codes should experience two orders of > magnitude speedup from the use of these alternative systems. > > The arrival of map-reduce 2.0 (or the equivalent use of Mesos) should > actually liberate us from the tyranny of a single compute model, but it > does > bring us to the point of having to decide how much of this sort of thing we > want to depend on from another project or how much we want to take on > ourselves. For instance, it might well be possible to co-opt the Spark > community as part of our own ... their purpose was to support machine > learning and joining forces might help achieve that on a broader scale than > previously envisioned. Graphlab is an interesting case as well. > > On Sun, Sep 4, 2011 at 11:42 PM, Jake Mannix <[email protected]> > wrote: > > > Hey gang, > > > > Has anyone here played much with > > Giraph<http://incubator.apache.org/giraph/>(currently now in the > > Apache Incubator)? One of my co-workers ran it on our > > corporate Hadoop cluster this past weekend, and found it did a very fast > > PageRank computation (far faster than even well-tuned M/R code on the > same > > data), and it worked pretty close to out-of-the box. Seems like that > style > > of computation (in-memory distributed datasets), as used by Giraph (and > the > > recently-discussed-on-this-list GraphLab <http://graphlab.org/>, and > > Spark<http://www.spark-project.org/>, and > > Twister <http://www.iterativemapreduce.org/>, and Vowpal > > Wabbit<http://hunch.net/~vw/>, > > and probably a few others) is more and more the way to go for a lot of > the > > things we want to do - scalable machine learning. "RAM is the new Disk, > > and > > Disk is the new Tape" after all... > > > > Giraph in particular seems nice, in that it runs on top of "old > fashioned" > > Hadoop - it takes up (long-lived) Mapper slots on your regular cluster, > > spins up a ZK cluster if you don't supply the location of one, and is all > > in > > java (which may be a minus, for some people, I guess, but having to run > > some > > big exec'ed out C++ code (GraphLab, VW), or run on-top of (admittedly > > awesome) Mesos (Spark [which while running on the JVM, is also in > Scala]), > > or run its own totally custom inter-server communication and data > > structures > > (Twister and many of the others)). > > > > Seems we should be not just supportive of this kind of thing, but try > and > > find some common ground and integration points. > > > > -jake > > >
