I finished an excursion into Giraph's code and now I kinda know what it
takes to port Giraph over to run on top of YARN.
When the base Hadoop clusters are replaced by YARN clusters, Giraph will
have two options:
- *Giraph still works over mapreduce APIs*: Even after moving to YARN
clusters, Giraph can still run over MapreduceV2+YARN. Without any code
changes at all.
- *Giraph works natively onYARN*: This can be done in such a way that in
the medium term, Giraph can continue to work on both a Hadoop Mapreduce
cluster as well as a YARN cluster. Two visible effects when this effort goes
underway, that I can think of:
-- There will be some moving around of classes/interface to separate
APIs from implementation details and a bit of reorganisation of code to help
support both GiraphV1 and GiraphV2.
-- The other thing the port will probably affect is a fork in the
community's attention (depending on how much of the community's eyeballs the
new world grabs as opposed to the stabilization/feature work on GiraphV1).
Now here's the thing. Avery indicated on the other thread(about Giraph over
HAMA) that most of the users of Giraph need to work on top of a hadoop
mapreduce cluster for quite some time. Which I completely agree with, being
a long time maintainer/supporting-dev of Hadoop clusters myself.
Given that concern, before embarking on the port, I thought I'd get opinions
from the community.