Re: Port to YARN: GIRAPH and HAMA

Avery Ching Wed, 14 Sep 2011 14:46:10 -0700

Vinod, thanks for your comments.  I've replied inline.


Avery

On 9/14/11 11:09 AM, Vinod Kumar Vavilapalli wrote:

Avery,

Some replies inline to the issues you outlined.

1)  Giraph runs completely as a MapReduce job on Hadoop today.  This needs

to be maintained to support our current users, who will not likely move to
MRv2 for at least a year.
I think what you need is to support Giraph's graph API for your users, but
no, not the underlying implementation. (Or are you leaking MapReduce APIs to
your users?) Sure, you are restricted to the under implementation(Hadoop
MRV1 or MRV2 whenever it gets used) at any point of time, but what we are
discussing is _that_ future when the underlying implementation itself also
moves to MRV2.

I think the takeaway should be that our clients (at Yahoo! andelsewhere) are currently using Giraph on MRv1. While the Giraph API isnot exposing the underlying infrastructure APIs (i.e. MRv1 and MRv2), westill need to support the MRv1 implementation even while webegin/complete the port to MRv2. I imagine that we will need to supportboth MRv1 and MRv2 for a fairly long period of time as the transition toMRv2 for a company (i.e. Yahoo!) could take a very long time (i.e.anywhere between 8 months to multiple years). Some of our internalclusters at Yahoo! today are still running 0.20.1 for example.

2)  The internals of Giraph are implemented differently than Hama..

Sure, but only at present. My original question is - given a BSP
implementation on a YARN cluster, can GiraphV2(BSP based) be simply
implemented over that or not. If today, GiraphV1 uses (its own) BSP
implementation over mapreduce APIs on Hadoop MRV1 cluster, I can clearly see
how GiraphV2 can be using (HAMA's) BSP implemented over YARN APIs.

In theory this is true. However, as mentioned previously, we still haveusers on MRv1 and will need to support it for a long time (i.e. at leasta year, probably more). Also I'm fairly certain that during the nextyear, we will have non-BSP based graph processing computing models inplace as well. For these reasons, it may not make sense to try to putGiraph on top of HAMA even when we are both on MRv2. It's hard to saynow as it is early. Let's visit this at a later time.

3)  If we have various graph processing computing models (BSP based,

streams or asynchronous, or a combination), then being on Hama brings little
value for Giraph.
That future isn't there yet. In any case, I'd bet when you get there, lot of
what you have now also wouldn't be an out-of-the-box fit.

 From my perspective (a third person POV), this is what I can conclude.
Giraph's velocity on Hadoop MapReduce may be real the impedence for thinking
about a possible sharing of the bsp based implementation with HAMAV2. Sure,
Giraph has other ideas regarding the computation model itself, but that is a
future that isn't here yet.

I just hope the same velocity isn't an impedance for thinking about the
next-gen version on top of YARN :) The way I see it, porting Giraph to YARN
is also a revolution in itself; most, if not all, of the implementation will
change yet with the API level compatibility. I am still eagerly looking
forward to the port of Giraph to YARN. May be more digging into Giraph
internals may help my cause too.

Giraph does appear to be moving with a fast velocity currently, but wehave a clear intention to run on top of MRv2. Please seehttps://issues.apache.org/jira/browse/GIRAPH-13. Obviously, the MRv2changes are much better suited for Giraph and we look forward to the daywhen nearly all Hadoop instances are running MRv2.

If nothing, this discussion atleast helped sharing of some of the ideas
between the two communities.

Thanks all for putting down in your thoughts.
+Vinod


On Wed, Sep 14, 2011 at 11:46 AM, Thomas Jungblut<
[email protected]>  wrote:

  We are also thinking about other underlying computing models (i.e.

streaming (asynchronous) graph processing - see


That is a really cool idea. But I don't think we are going to focus solely
on graph computing. We want to enable an interface which can be used for it
(straight forward as described in the Pregel Paper), but I think you are
really graph experts- so we don't want to compete with each other :D
Our asynchronous processing (in my opinion) will just enable the sending of
messages within the computation phase. So the BarrierSync is just a little
transition to make sure every task is ready and every message has been send.
Your Vertex locking is a graph-only feature, this won't be effecting us
anyways.


Giraph runs completely as a MapReduce job on Hadoop today.
Allright.

I think our result is the following:
We (Apache Hama) are focussing on the YARN implementation of the BSP
paradigm.
If you want to run Giraph on a real BSP engine later, feel free to put your
stuff on top of that.
As far as I have seen, there is a 100% backward compatibility of YARN, so
your current solution will run on YARN either.

Best Regards,

Thomas

Re: Port to YARN: GIRAPH and HAMA

Reply via email to