Hey gang,

  Has anyone here played much with
Giraph<http://incubator.apache.org/giraph/>(currently now in the
Apache Incubator)?  One of my co-workers ran it on our
corporate Hadoop cluster this past weekend, and found it did a very fast
PageRank computation (far faster than even well-tuned M/R code on the same
data), and it worked pretty close to out-of-the box.  Seems like that style
of computation (in-memory distributed datasets), as used by Giraph (and the
recently-discussed-on-this-list GraphLab <http://graphlab.org/>, and
Spark<http://www.spark-project.org/>, and
Twister <http://www.iterativemapreduce.org/>, and Vowpal
Wabbit<http://hunch.net/~vw/>,
and probably a few others) is more and more the way to go for a lot of the
things we want to do - scalable machine learning.  "RAM is the new Disk, and
Disk is the new Tape" after all...

  Giraph in particular seems nice, in that it runs on top of "old fashioned"
Hadoop - it takes up (long-lived) Mapper slots on your regular cluster,
spins up a ZK cluster if you don't supply the location of one, and is all in
java (which may be a minus, for some people, I guess, but having to run some
big exec'ed out C++ code (GraphLab, VW), or run on-top of (admittedly
awesome) Mesos (Spark [which while running on the JVM, is also in Scala]),
or run its own totally custom inter-server communication and data structures
(Twister and many of the others)).

  Seems we should be not just supportive of this kind of thing, but try and
find some common ground and integration points.

  -jake

Reply via email to