Hey gang, Has anyone here played much with Giraph<http://incubator.apache.org/giraph/>(currently now in the Apache Incubator)? One of my co-workers ran it on our corporate Hadoop cluster this past weekend, and found it did a very fast PageRank computation (far faster than even well-tuned M/R code on the same data), and it worked pretty close to out-of-the box. Seems like that style of computation (in-memory distributed datasets), as used by Giraph (and the recently-discussed-on-this-list GraphLab <http://graphlab.org/>, and Spark<http://www.spark-project.org/>, and Twister <http://www.iterativemapreduce.org/>, and Vowpal Wabbit<http://hunch.net/~vw/>, and probably a few others) is more and more the way to go for a lot of the things we want to do - scalable machine learning. "RAM is the new Disk, and Disk is the new Tape" after all...
Giraph in particular seems nice, in that it runs on top of "old fashioned" Hadoop - it takes up (long-lived) Mapper slots on your regular cluster, spins up a ZK cluster if you don't supply the location of one, and is all in java (which may be a minus, for some people, I guess, but having to run some big exec'ed out C++ code (GraphLab, VW), or run on-top of (admittedly awesome) Mesos (Spark [which while running on the JVM, is also in Scala]), or run its own totally custom inter-server communication and data structures (Twister and many of the others)). Seems we should be not just supportive of this kind of thing, but try and find some common ground and integration points. -jake
