I was very surprised by this as well. I was doing variants on all-pairs shortest paths and found that the best representation really was triples containing from-node, to-node and distance. The nice side of this is that you get scaling like you wouldn't believe (subject to big-omega, of course)
On Thu, Oct 16, 2008 at 4:05 PM, Colin Evans <[EMAIL PROTECTED]> wrote: > The trick is to amortize your computation over the whole set. So DFS for a > single node will always be faster on an in-memory graph, but Hadoop is a > good tool for computing all-pairs shortest paths in one shot if you re-frame > the algorithm as a belief propagation and message passing algorithm. > > A lot of the time, the computation still explodes into n^2 or worse, so you > need to use a binning or blocking algorithm, like the one described here: > http://www.youtube.com/watch?v=1ZDybXl212Q > > In the case of graphs, a blocking function would be to find overlapping > strongly connected subgraphs where each subgraph fits in a reasonable amount > of memory. Then within each block, you do your computation and you pass a > summary of that computation to adjacent blocks,which gets factored into the > next computation. > > When we hooked up a Very Big Graph to our Hadoop cluster, we found that > there were a lot of scaling problems, which went away when we started > optimizing for streaming performance. > > -Colin > > > > > Bhupesh Bansal wrote: > >> Can you elaborate here , >> >> Lets say I want to implement a DFS in my graph. I am not able to picturise >> implementing it with doing graph in pieces without putting a depth bound >> to >> (3-4). Lets say we have 200M (4GB) edges to start with >> >> Best >> Bhupesh >> >> >> >> On 10/16/08 3:01 PM, "Owen O'Malley" <[EMAIL PROTECTED]> wrote: >> >> >> >>> On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote: >>> >>> >>> >>>> We at Linkedin are trying to run some Large Graph Analysis problems on >>>> Hadoop. The fastest way to run would be to keep a copy of whole >>>> Graph in RAM >>>> at all mappers. (Graph size is about 8G in RAM) we have cluster of 8- >>>> cores >>>> machine with 8G on each. >>>> >>>> >>> The best way to deal with it is *not* to load the entire graph in one >>> process. In the WebMap at Yahoo, we have a graph of the web that has >>> roughly 1 trillion links and 100 billion nodes. See >>> http://tinyurl.com/4fgok6 >>> . To invert the links, you process the graph in pieces and resort >>> based on the target. You'll get much better performance and scale to >>> almost any size. >>> >>> >>> >>>> Whats is the best way of doing that ?? Is there a way so that multiple >>>> mappers on same machine can access a RAM cache ?? I read about hadoop >>>> distributed cache looks like it's copies the file (hdfs / http) >>>> locally on >>>> the slaves but not necessrily in RAM ?? >>>> >>>> >>> You could mmap the file from distributed cache using MappedByteBuffer. >>> Then there will be one copy between jvms... >>> >>> -- Owen >>> >>> >> >> >> > > -- ted
