Re: giraph stability problem

```I have seen people run into this problem when doing Graph Processing
directly on top of Hadoop too. This kind of an approach would take
exponential space.```
```
While the proposed solution would prevent the OOM problem, eventually one
would run out of disk space.

On Mon, Jan 23, 2012 at 6:16 AM, Claudio Martella <
claudio.marte...@gmail.com> wrote:

> Hi,
>
> I've been struggling with a similar problem and that's why i've
> started working on the out-of-core message management, when the memory
> shrinks. Your particular problem can get to an upper bound of
> exponential space complexity, which you experience with the OOM. the
> possible paths you can extract for each source vertex are about
> O(d^l), where d is the average degree of the graph and l is the max
> length of the extracted path (if you do shortest paths, then it's the
> diameter).
>
> Giraph is all good and fast but it's all in-memory and for this reason
> it currently lacks a solution to your problem. I suggest you wait
> until GIRAPH-45 is ready (I should write an email tonight about that
> patch).
>
> Hope it makes sense to you,
> Claudio
>
> On Mon, Jan 23, 2012 at 10:56 AM, André Kelpe
> > Hi list,
> >
> > I have been investigating giraph for a week now and I have a huge
> stability
> > problem. I am running the trunk against a CDH3u2 hadoop cluster. The
> problem
> > I am trying to solve goes as follows:
> >
> > In a graph there are 3 kinds of vertices:
> >
> > 1: end of the world vertices, which have only 1 connected edge
> > 2: bivalent vertices, which have exactly 2 connected edges
> > 3: multivalent vertices that have n-edges connected (n>2)
> >
> > The goal is now to calculate for each vertex that is not in category 2
> all
> > the paths to the other reachable non bivalent vertices. One could say all
> > pathes between all non-bivalent vertices.
> >
> > To give an example:
> >
> >                                        [9]
> >                                         |
> >                                         |
> >                                        <12>
> >                                         |
> >                                         |
> >      [5]-----<13>-----[6]-----<11>-----[7]-----<10>-----[8]
> >
> > In this path I want to know that [5] forms a path to [7] via the edges
> <13>
> > and <11>. [7] forms a path via <12> with [9], via <10> with [8] and via
> > <11><13> with [5]. You get the idea... Directionality is not important.
> >
> > The algorithm I have is pretty straight forward, in superstep 0 all
> vertices
> > that are non-bivalent send a message to their neighbours via which edge
> they
> > are reachable. In all following supersteps the bivalent vertices are
> simply
> > forwarding this information and the non-bivalent ones are terminating the
> > algorithm. The messages that they sent are made using Textwritable
> instances
> > encoding the path.
> >
> > If I run this algorithm on input with 1 million edges it never finishes,
> the
> > master process and then the others always go out of memory, even with a
> 10GB
> > heap. I know that java programs can be memory hungry, but 10GB heap for
> > processing a 40MB input file is a bit to much in book. I have tried all
> sorts
> > of settings, like disabling checkpoints, but nothing makes it finish. I
> also
> > see a slowdown in processing, the first 20ish supersteps are done in no
> time,
> > but then the processing slows down until it crashes in superstep 47.
> >
> > My questions are: What am I doing wrong? Do you guys have any pointers
> for
> > things to look after? How can I get this thing to finish? Why is it so
> memory
> > hungry?
> >
> > Thanks a lot for your help!
> >
> > -André
>
>
>
> --
>    Claudio Martella
>    claudio.marte...@gmail.com
>
```