I think that alternatives to traditional map-reduce are really important for
ML.  If you think about it, it is now common for 30-40 machine clusters to
have a terabyte of RAM and it is actually slightly unusual to have ML
datasets that large.  As such, BSP-like compute models have a lot to
recommend them.

Also, there are implementations of BSP or pregel that allow the computation
to exceed memory by requiring all nodes to be serializable.  Only the highly
active ones are kept in memory.  This allows a graceful transition from
complete memory-residency to something more like traditional hadoop-style
map-reduce.

Another interesting model is that of Spark.  I can well imagine that much of
what we do could be replaced with very small Spark programs which could be
composed much more easily than our current map-reduce command-line stuff
could be glued together.  Lots of codes should experience two orders of
magnitude speedup from the use of these alternative systems.

The arrival of map-reduce 2.0 (or the equivalent use of Mesos) should
actually liberate us from the tyranny of a single compute model, but it does
bring us to the point of having to decide how much of this sort of thing we
want to depend on from another project or how much we want to take on
ourselves.  For instance, it might well be possible to co-opt the Spark
community as part of our own ... their purpose was to support machine
learning and joining forces might help achieve that on a broader scale than
previously envisioned.  Graphlab is an interesting case as well.

On Sun, Sep 4, 2011 at 11:42 PM, Jake Mannix <[email protected]> wrote:

> Hey gang,
>
>  Has anyone here played much with
> Giraph<http://incubator.apache.org/giraph/>(currently now in the
> Apache Incubator)?  One of my co-workers ran it on our
> corporate Hadoop cluster this past weekend, and found it did a very fast
> PageRank computation (far faster than even well-tuned M/R code on the same
> data), and it worked pretty close to out-of-the box.  Seems like that style
> of computation (in-memory distributed datasets), as used by Giraph (and the
> recently-discussed-on-this-list GraphLab <http://graphlab.org/>, and
> Spark<http://www.spark-project.org/>, and
> Twister <http://www.iterativemapreduce.org/>, and Vowpal
> Wabbit<http://hunch.net/~vw/>,
> and probably a few others) is more and more the way to go for a lot of the
> things we want to do - scalable machine learning.  "RAM is the new Disk,
> and
> Disk is the new Tape" after all...
>
>  Giraph in particular seems nice, in that it runs on top of "old fashioned"
> Hadoop - it takes up (long-lived) Mapper slots on your regular cluster,
> spins up a ZK cluster if you don't supply the location of one, and is all
> in
> java (which may be a minus, for some people, I guess, but having to run
> some
> big exec'ed out C++ code (GraphLab, VW), or run on-top of (admittedly
> awesome) Mesos (Spark [which while running on the JVM, is also in Scala]),
> or run its own totally custom inter-server communication and data
> structures
> (Twister and many of the others)).
>
>  Seems we should be not just supportive of this kind of thing, but try and
> find some common ground and integration points.
>
>  -jake
>

Reply via email to