There's not much you can do If you're trying an algorithm with exponential space complexity and you don't have enough disk space. What do you suggest?

## Advertising

On Mon, Jan 23, 2012 at 4:13 PM, Deepak Nettem <deepaknet...@gmail.com> wrote: > I have seen people run into this problem when doing Graph Processing > directly on top of Hadoop too. This kind of an approach would take > exponential space. > > While the proposed solution would prevent the OOM problem, eventually one > would run out of disk space. > > > On Mon, Jan 23, 2012 at 6:16 AM, Claudio Martella > <claudio.marte...@gmail.com> wrote: >> >> Hi, >> >> I've been struggling with a similar problem and that's why i've >> started working on the out-of-core message management, when the memory >> shrinks. Your particular problem can get to an upper bound of >> exponential space complexity, which you experience with the OOM. the >> possible paths you can extract for each source vertex are about >> O(d^l), where d is the average degree of the graph and l is the max >> length of the extracted path (if you do shortest paths, then it's the >> diameter). >> >> Giraph is all good and fast but it's all in-memory and for this reason >> it currently lacks a solution to your problem. I suggest you wait >> until GIRAPH-45 is ready (I should write an email tonight about that >> patch). >> >> Hope it makes sense to you, >> Claudio >> >> On Mon, Jan 23, 2012 at 10:56 AM, André Kelpe >> <efeshundert...@googlemail.com> wrote: >> > Hi list, >> > >> > I have been investigating giraph for a week now and I have a huge >> > stability >> > problem. I am running the trunk against a CDH3u2 hadoop cluster. The >> > problem >> > I am trying to solve goes as follows: >> > >> > In a graph there are 3 kinds of vertices: >> > >> > 1: end of the world vertices, which have only 1 connected edge >> > 2: bivalent vertices, which have exactly 2 connected edges >> > 3: multivalent vertices that have n-edges connected (n>2) >> > >> > The goal is now to calculate for each vertex that is not in category 2 >> > all >> > the paths to the other reachable non bivalent vertices. One could say >> > all >> > pathes between all non-bivalent vertices. >> > >> > To give an example: >> > >> > [9] >> > | >> > | >> > <12> >> > | >> > | >> > [5]-----<13>-----[6]-----<11>-----[7]-----<10>-----[8] >> > >> > In this path I want to know that [5] forms a path to [7] via the edges >> > <13> >> > and <11>. [7] forms a path via <12> with [9], via <10> with [8] and via >> > <11><13> with [5]. You get the idea... Directionality is not important. >> > >> > The algorithm I have is pretty straight forward, in superstep 0 all >> > vertices >> > that are non-bivalent send a message to their neighbours via which edge >> > they >> > are reachable. In all following supersteps the bivalent vertices are >> > simply >> > forwarding this information and the non-bivalent ones are terminating >> > the >> > algorithm. The messages that they sent are made using Textwritable >> > instances >> > encoding the path. >> > >> > If I run this algorithm on input with 1 million edges it never finishes, >> > the >> > master process and then the others always go out of memory, even with a >> > 10GB >> > heap. I know that java programs can be memory hungry, but 10GB heap for >> > processing a 40MB input file is a bit to much in book. I have tried all >> > sorts >> > of settings, like disabling checkpoints, but nothing makes it finish. I >> > also >> > see a slowdown in processing, the first 20ish supersteps are done in no >> > time, >> > but then the processing slows down until it crashes in superstep 47. >> > >> > My questions are: What am I doing wrong? Do you guys have any pointers >> > for >> > things to look after? How can I get this thing to finish? Why is it so >> > memory >> > hungry? >> > >> > Thanks a lot for your help! >> > >> > -André >> >> >> >> -- >> Claudio Martella >> claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com