Vinod, thanks for your comments. I've replied inline.
On 9/14/11 11:09 AM, Vinod Kumar Vavilapalli wrote:
I think the takeaway should be that our clients (at Yahoo! and
elsewhere) are currently using Giraph on MRv1. While the Giraph API is
not exposing the underlying infrastructure APIs (i.e. MRv1 and MRv2), we
still need to support the MRv1 implementation even while we
begin/complete the port to MRv2. I imagine that we will need to support
both MRv1 and MRv2 for a fairly long period of time as the transition to
MRv2 for a company (i.e. Yahoo!) could take a very long time (i.e.
anywhere between 8 months to multiple years). Some of our internal
clusters at Yahoo! today are still running 0.20.1 for example.
Some replies inline to the issues you outlined.
1) Giraph runs completely as a MapReduce job on Hadoop today. This needs
to be maintained to support our current users, who will not likely move to
MRv2 for at least a year.
I think what you need is to support Giraph's graph API for your users, but
no, not the underlying implementation. (Or are you leaking MapReduce APIs to
your users?) Sure, you are restricted to the under implementation(Hadoop
MRV1 or MRV2 whenever it gets used) at any point of time, but what we are
discussing is _that_ future when the underlying implementation itself also
moves to MRV2.
In theory this is true. However, as mentioned previously, we still have
users on MRv1 and will need to support it for a long time (i.e. at least
a year, probably more). Also I'm fairly certain that during the next
year, we will have non-BSP based graph processing computing models in
place as well. For these reasons, it may not make sense to try to put
Giraph on top of HAMA even when we are both on MRv2. It's hard to say
now as it is early. Let's visit this at a later time.
2) The internals of Giraph are implemented differently than Hama..
Sure, but only at present. My original question is - given a BSP
implementation on a YARN cluster, can GiraphV2(BSP based) be simply
implemented over that or not. If today, GiraphV1 uses (its own) BSP
implementation over mapreduce APIs on Hadoop MRV1 cluster, I can clearly see
how GiraphV2 can be using (HAMA's) BSP implemented over YARN APIs.
Giraph does appear to be moving with a fast velocity currently, but we
have a clear intention to run on top of MRv2. Please see
https://issues.apache.org/jira/browse/GIRAPH-13. Obviously, the MRv2
changes are much better suited for Giraph and we look forward to the day
when nearly all Hadoop instances are running MRv2.
3) If we have various graph processing computing models (BSP based,
streams or asynchronous, or a combination), then being on Hama brings little
value for Giraph.
That future isn't there yet. In any case, I'd bet when you get there, lot of
what you have now also wouldn't be an out-of-the-box fit.
From my perspective (a third person POV), this is what I can conclude.
Giraph's velocity on Hadoop MapReduce may be real the impedence for thinking
about a possible sharing of the bsp based implementation with HAMAV2. Sure,
Giraph has other ideas regarding the computation model itself, but that is a
future that isn't here yet.
I just hope the same velocity isn't an impedance for thinking about the
next-gen version on top of YARN :) The way I see it, porting Giraph to YARN
is also a revolution in itself; most, if not all, of the implementation will
change yet with the API level compatibility. I am still eagerly looking
forward to the port of Giraph to YARN. May be more digging into Giraph
internals may help my cause too.
If nothing, this discussion atleast helped sharing of some of the ideas
between the two communities.
Thanks all for putting down in your thoughts.
On Wed, Sep 14, 2011 at 11:46 AM, Thomas Jungblut<
We are also thinking about other underlying computing models (i.e.
streaming (asynchronous) graph processing - see
That is a really cool idea. But I don't think we are going to focus solely
on graph computing. We want to enable an interface which can be used for it
(straight forward as described in the Pregel Paper), but I think you are
really graph experts- so we don't want to compete with each other :D
Our asynchronous processing (in my opinion) will just enable the sending of
messages within the computation phase. So the BarrierSync is just a little
transition to make sure every task is ready and every message has been send.
Your Vertex locking is a graph-only feature, this won't be effecting us
Giraph runs completely as a MapReduce job on Hadoop today.
I think our result is the following:
We (Apache Hama) are focussing on the YARN implementation of the BSP
If you want to run Giraph on a real BSP engine later, feel free to put your
stuff on top of that.
As far as I have seen, there is a 100% backward compatibility of YARN, so
your current solution will run on YARN either.