I have been investigating giraph for a week now and I have a huge stability
problem. I am running the trunk against a CDH3u2 hadoop cluster. The problem
I am trying to solve goes as follows:
In a graph there are 3 kinds of vertices:
1: end of the world vertices, which have only 1 connected edge
2: bivalent vertices, which have exactly 2 connected edges
3: multivalent vertices that have n-edges connected (n>2)
The goal is now to calculate for each vertex that is not in category 2 all
the paths to the other reachable non bivalent vertices. One could say all
pathes between all non-bivalent vertices.
To give an example:
In this path I want to know that  forms a path to  via the edges <13>
and <11>.  forms a path via <12> with , via <10> with  and via
<11><13> with . You get the idea... Directionality is not important.
The algorithm I have is pretty straight forward, in superstep 0 all vertices
that are non-bivalent send a message to their neighbours via which edge they
are reachable. In all following supersteps the bivalent vertices are simply
forwarding this information and the non-bivalent ones are terminating the
algorithm. The messages that they sent are made using Textwritable instances
encoding the path.
If I run this algorithm on input with 1 million edges it never finishes, the
master process and then the others always go out of memory, even with a 10GB
heap. I know that java programs can be memory hungry, but 10GB heap for
processing a 40MB input file is a bit to much in book. I have tried all sorts
of settings, like disabling checkpoints, but nothing makes it finish. I also
see a slowdown in processing, the first 20ish supersteps are done in no time,
but then the processing slows down until it crashes in superstep 47.
My questions are: What am I doing wrong? Do you guys have any pointers for
things to look after? How can I get this thing to finish? Why is it so memory
Thanks a lot for your help!