Hi list,

I have been investigating giraph for a week now and I have a huge stability
problem. I am running the trunk against a CDH3u2 hadoop cluster. The problem
I am trying to solve goes as follows:

In a graph there are 3 kinds of vertices:

1: end of the world vertices, which have only 1 connected edge
2: bivalent vertices, which have exactly 2 connected edges
3: multivalent vertices that have n-edges connected (n>2)

The goal is now to calculate for each vertex that is not in category 2 all
the paths to the other reachable non bivalent vertices. One could say all
pathes between all non-bivalent vertices.

To give an example:


In this path I want to know that [5] forms a path to [7] via the edges <13>
and <11>. [7] forms a path via <12> with [9], via <10> with [8] and via
<11><13> with [5]. You get the idea... Directionality is not important.

The algorithm I have is pretty straight forward, in superstep 0 all vertices
that are non-bivalent send a message to their neighbours via which edge they
are reachable. In all following supersteps the bivalent vertices are simply
forwarding this information and the non-bivalent ones are terminating the
algorithm. The messages that they sent are made using Textwritable instances
encoding the path.

If I run this algorithm on input with 1 million edges it never finishes, the
master process and then the others always go out of memory, even with a 10GB
heap. I know that java programs can be memory hungry, but 10GB heap for
processing a 40MB input file is a bit to much in book. I have tried all sorts
of settings, like disabling checkpoints, but nothing makes it finish. I also
see a slowdown in processing, the first 20ish supersteps are done in no time,
but then the processing slows down until it crashes in superstep 47.

My questions are: What am I doing wrong? Do you guys have any pointers for
things to look after? How can I get this thing to finish? Why is it so memory

Thanks a lot for your help!


Reply via email to