Re: Apache Giraph?

Jake Mannix Fri, 09 Sep 2011 07:34:54 -0700

On Fri, Sep 9, 2011 at 1:12 AM, Sebastian Schelter <[email protected]> wrote:

> On 05.09.2011 06:42, Jake Mannix wrote:
>
> > One of my co-workers ran it on our
> > corporate Hadoop cluster this past weekend, and found it did a very fast
> > PageRank computation (far faster than even well-tuned M/R code on the
> same
> > data), and it worked pretty close to out-of-the box.
>
> Could you share some more details? Which kind of implementations did he
> use? Power-iterations over the adjacency matrix should be the fastest
> way to in M/R.
>

When you're in "pregel-land", you don't think about "matrices" anymore, you
think in graph terms (which is hard for me, because for me, everything is
a matrix), but basic implementation in Giraph is power iterations, yes.  But
the point is that it's all in memory, all the time, there's no "shuffle",
and
you're talking shard-to-shard via direct (Hadoop RPC, currently) remote
network connection.

> I'm beginning to look at Giraph (and also taking a deeper look at the
> Pregel paper). I think the "Vertex"-paradigm is much more intuitive and
> easier to use for implementing graph algorithms than plain MapReduce.
>

For graph algorithms, yes.  But that's not all.  Anything which fits in the
land of "BSP" computations can be done this way, and I've been exploring
relaxing that a bit as well (if shards can talk to each other in bulk, once
per
"superstep", why not also allow vertices communicate asynchronously
*during* a superstep?), and seeing what further iterative algorithms are
possible.

> So if Giraph is so much faster and at the same time easier to use, we
> should think about basing our graph algorithms (that are still very much
> in their infancy) on it, given a stable release exists.
>

Well, the constraint with Giraph is that everything must fit in
(distributed)
memory.  But it's not much of a scalability bottleneck, as many
(non-logfile)
datasets on reasonably sized clusters can indeed fit in memory.

But yes, I think Giraph is the place to develop any graph algorithms we
have.  Not sure of the right integration points, however.  We could
depend on Giraph, but why not just implement graph-specific algorithms
*in Giraph*?  If it needs some of our math, why not just have Giraph
depend on *us*?

> As far as I understand it, Giraph runs as a standard M/R job in Hadoop
> right? So there is no installation necessary in the cluster.
>

That's correct. Something schematically like this runs it:

  "hadoop jar giraph-with-dependencies-0.70.jar \
      org.apache.giraph.examples.SimplePageRankVertex \
      -i hdfs://mygraphinput -o hdfs://mypagerankoutput -maxIter 30"

And it runs on a vanilla Hadoop install, yes.

It's really young, yet, with not too many developers, but it's actually
a lot of code already (find . -name "*.java" | xargs wc -l => 17462 in
my current dev branch I'm monkeying with).

  -jake

Re: Apache Giraph?

Reply via email to