I have seen people run into this problem when doing Graph Processing directly on top of Hadoop too. This kind of an approach would take exponential space.

## Advertising

While the proposed solution would prevent the OOM problem, eventually one would run out of disk space. On Mon, Jan 23, 2012 at 6:16 AM, Claudio Martella < claudio.marte...@gmail.com> wrote: > Hi, > > I've been struggling with a similar problem and that's why i've > started working on the out-of-core message management, when the memory > shrinks. Your particular problem can get to an upper bound of > exponential space complexity, which you experience with the OOM. the > possible paths you can extract for each source vertex are about > O(d^l), where d is the average degree of the graph and l is the max > length of the extracted path (if you do shortest paths, then it's the > diameter). > > Giraph is all good and fast but it's all in-memory and for this reason > it currently lacks a solution to your problem. I suggest you wait > until GIRAPH-45 is ready (I should write an email tonight about that > patch). > > Hope it makes sense to you, > Claudio > > On Mon, Jan 23, 2012 at 10:56 AM, André Kelpe > <efeshundert...@googlemail.com> wrote: > > Hi list, > > > > I have been investigating giraph for a week now and I have a huge > stability > > problem. I am running the trunk against a CDH3u2 hadoop cluster. The > problem > > I am trying to solve goes as follows: > > > > In a graph there are 3 kinds of vertices: > > > > 1: end of the world vertices, which have only 1 connected edge > > 2: bivalent vertices, which have exactly 2 connected edges > > 3: multivalent vertices that have n-edges connected (n>2) > > > > The goal is now to calculate for each vertex that is not in category 2 > all > > the paths to the other reachable non bivalent vertices. One could say all > > pathes between all non-bivalent vertices. > > > > To give an example: > > > > [9] > > | > > | > > <12> > > | > > | > > [5]-----<13>-----[6]-----<11>-----[7]-----<10>-----[8] > > > > In this path I want to know that [5] forms a path to [7] via the edges > <13> > > and <11>. [7] forms a path via <12> with [9], via <10> with [8] and via > > <11><13> with [5]. You get the idea... Directionality is not important. > > > > The algorithm I have is pretty straight forward, in superstep 0 all > vertices > > that are non-bivalent send a message to their neighbours via which edge > they > > are reachable. In all following supersteps the bivalent vertices are > simply > > forwarding this information and the non-bivalent ones are terminating the > > algorithm. The messages that they sent are made using Textwritable > instances > > encoding the path. > > > > If I run this algorithm on input with 1 million edges it never finishes, > the > > master process and then the others always go out of memory, even with a > 10GB > > heap. I know that java programs can be memory hungry, but 10GB heap for > > processing a 40MB input file is a bit to much in book. I have tried all > sorts > > of settings, like disabling checkpoints, but nothing makes it finish. I > also > > see a slowdown in processing, the first 20ish supersteps are done in no > time, > > but then the processing slows down until it crashes in superstep 47. > > > > My questions are: What am I doing wrong? Do you guys have any pointers > for > > things to look after? How can I get this thing to finish? Why is it so > memory > > hungry? > > > > Thanks a lot for your help! > > > > -André > > > > -- > Claudio Martella > claudio.marte...@gmail.com >