Ted,

  As usual, you're thinking the same way I am, it seems.

On Mon, Sep 5, 2011 at 12:07 AM, Ted Dunning <[email protected]> wrote:

> I think that alternatives to traditional map-reduce are really important
> for
> ML.  If you think about it, it is now common for 30-40 machine clusters to
> have a terabyte of RAM and it is actually slightly unusual to have ML
> datasets that large.  As such, BSP-like compute models have a lot to
> recommend them.
>

  Yep - Big Data seems to me to fall into a hierarchy these days:

  Sensor data >> Log data >> User-Generated data, Graph data >> User data,
Text corpora

  Where the first one lives in the mother of all Big Data domains, hard
science (think:
CERN, and time-series spatial weather data), and already most compute
clusters
have more RAM than anything except the first two.  Sure, web logs don't fit
into
distributed RAM, but even FB and Twitter's entire social graph will fit into
main
memory of even a *very small* cluster.  I don't know of many text corpora
which don't fit into a TB or two of RAM (unless you're talking web-wide, or
"all Tweets across time").

Also, there are implementations of BSP or pregel that allow the computation
> to exceed memory by requiring all nodes to be serializable.  Only the
> highly
> active ones are kept in memory.  This allows a graceful transition from
> complete memory-residency to something more like traditional hadoop-style
> map-reduce.
>

   Yeah, when this is done right (HARD!), it basically makes for the
universally
scalable kind of computation we want.  Truly in-memory when possible, but
doesn't require it.


> Another interesting model is that of Spark.  I can well imagine that much
> of
> what we do could be replaced with very small Spark programs which could be
> composed much more easily than our current map-reduce command-line stuff
> could be glued together.  Lots of codes should experience two orders of
> magnitude speedup from the use of these alternative systems.
>

  This is my impression too.  The more I play with Spark, the more it looks
like
"the Right Paradigm" for this kind of computation: how many years has I been
complaining that all I've ever wanted from Hadoop (and/or Mahout) is to be
able
to say something like:

  vectors = load("hdfs://mydataFile");
  vectors.map(new Function<Vector, Vector>() {
                       Vector apply(Vector in) { return in.normailze(1); })
             .filter(new Predicate<Vector>() {
                       boolean apply(Vector in) { return
in.numNonDefaultValues() < 1000; })
            .reduce(new Function<Pair<Vector, Vector>, Vector>() {
                       Vector apply(Pair<Vector, Vector> pair) { return
pair.getFirst().plus(pair.getSecond()); });

  Spark lets you do exactly this kind of strongly typed mixed OO+functional
thinking
(except without the verbosity of Java, and with proper closures).


> The arrival of map-reduce 2.0 (or the equivalent use of Mesos) should
> actually liberate us from the tyranny of a single compute model, but it
> does
> bring us to the point of having to decide how much of this sort of thing we
> want to depend on from another project or how much we want to take on
> ourselves.


  Yes, this is a tricky part:

    Spark is awesome, but
      a) scala (good thing?), and
      b) dependent on Mesos (awesome thing, but not widely available/adopted
yet)

    Giraph is closer to our dependency-set:
      a) runs on raw Hadoop 0.20.3 and 0.20.203
      b) in java
      c) and is very similar to the usual programming model of M/R

  Point c) is pretty important: while you do get everything in-memory, you
still
program it in the way that is familiar to Hadoop-people: you have Job
objects you
can configure to fit your cluster's special shape and characteristics, etc.

  Anything YARN-based kinda worries me.  I am not sure how soon people
will really see production-grade next-gen M/R environments available to
them.

For instance, it might well be possible to co-opt the Spark
> community as part of our own ... their purpose was to support machine
> learning and joining forces might help achieve that on a broader scale than
> previously envisioned.  Graphlab is an interesting case as well.
>

  So there could be several steps in here.  Some people are never ever going
to move to Scala (I might not be one of those people, but in another flip
of the quantum bit that triggered some neural process, I could have been),
and
may never get access to Mesos (so sad for them - it really is pretty
magical),
so while I fully think finding a way to work with the Spark folk / adopt
them
as our own / start dating them is a good idea, I think that Giraph is of
more
immediate help:

  Giraph is
    a) an already mavenized
    b) Apache Incubator java project

We could modify a lot of our current algorithms to run on the exact same
clusters
they're currently running on, with a few more jars (which have already been
checked
for ASF compliance), but be tons faster, and let us start "thinking like a
vertex".
(although since everything for me is really a matrix, I'll be "thinking like
a vector which
is a row of a matrix", which is exactly the same thing).

So maybe it's a multi-step process: start integrating, depending on, or
absorbing
wholesale Giraph *soon*, help people write their adapters for things like
GraphLab
in parallel, and learn from Spark and see how much of our work could be
replaced with that in the longer term (most likely Spark and YARN will not
stay so
far apart in the long term either).

  -jake


> On Sun, Sep 4, 2011 at 11:42 PM, Jake Mannix <[email protected]>
> wrote:
>
> > Hey gang,
> >
> >  Has anyone here played much with
> > Giraph<http://incubator.apache.org/giraph/>(currently now in the
> > Apache Incubator)?  One of my co-workers ran it on our
> > corporate Hadoop cluster this past weekend, and found it did a very fast
> > PageRank computation (far faster than even well-tuned M/R code on the
> same
> > data), and it worked pretty close to out-of-the box.  Seems like that
> style
> > of computation (in-memory distributed datasets), as used by Giraph (and
> the
> > recently-discussed-on-this-list GraphLab <http://graphlab.org/>, and
> > Spark<http://www.spark-project.org/>, and
> > Twister <http://www.iterativemapreduce.org/>, and Vowpal
> > Wabbit<http://hunch.net/~vw/>,
> > and probably a few others) is more and more the way to go for a lot of
> the
> > things we want to do - scalable machine learning.  "RAM is the new Disk,
> > and
> > Disk is the new Tape" after all...
> >
> >  Giraph in particular seems nice, in that it runs on top of "old
> fashioned"
> > Hadoop - it takes up (long-lived) Mapper slots on your regular cluster,
> > spins up a ZK cluster if you don't supply the location of one, and is all
> > in
> > java (which may be a minus, for some people, I guess, but having to run
> > some
> > big exec'ed out C++ code (GraphLab, VW), or run on-top of (admittedly
> > awesome) Mesos (Spark [which while running on the JVM, is also in
> Scala]),
> > or run its own totally custom inter-server communication and data
> > structures
> > (Twister and many of the others)).
> >
> >  Seems we should be not just supportive of this kind of thing, but try
> and
> > find some common ground and integration points.
> >
> >  -jake
> >
>

Reply via email to