Re: Distributed cache Design
I was very surprised by this as well. I was doing variants on all-pairs shortest paths and found that the best representation really was triples containing from-node, to-node and distance. The nice side of this is that you get scaling like you wouldn't believe (subject to big-omega, of course) On Thu, Oct 16, 2008 at 4:05 PM, Colin Evans [EMAIL PROTECTED] wrote: The trick is to amortize your computation over the whole set. So DFS for a single node will always be faster on an in-memory graph, but Hadoop is a good tool for computing all-pairs shortest paths in one shot if you re-frame the algorithm as a belief propagation and message passing algorithm. A lot of the time, the computation still explodes into n^2 or worse, so you need to use a binning or blocking algorithm, like the one described here: http://www.youtube.com/watch?v=1ZDybXl212Q In the case of graphs, a blocking function would be to find overlapping strongly connected subgraphs where each subgraph fits in a reasonable amount of memory. Then within each block, you do your computation and you pass a summary of that computation to adjacent blocks,which gets factored into the next computation. When we hooked up a Very Big Graph to our Hadoop cluster, we found that there were a lot of scaling problems, which went away when we started optimizing for streaming performance. -Colin Bhupesh Bansal wrote: Can you elaborate here , Lets say I want to implement a DFS in my graph. I am not able to picturise implementing it with doing graph in pieces without putting a depth bound to (3-4). Lets say we have 200M (4GB) edges to start with Best Bhupesh On 10/16/08 3:01 PM, Owen O'Malley [EMAIL PROTECTED] wrote: On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote: We at Linkedin are trying to run some Large Graph Analysis problems on Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM at all mappers. (Graph size is about 8G in RAM) we have cluster of 8- cores machine with 8G on each. The best way to deal with it is *not* to load the entire graph in one process. In the WebMap at Yahoo, we have a graph of the web that has roughly 1 trillion links and 100 billion nodes. See http://tinyurl.com/4fgok6 . To invert the links, you process the graph in pieces and resort based on the target. You'll get much better performance and scale to almost any size. Whats is the best way of doing that ?? Is there a way so that multiple mappers on same machine can access a RAM cache ?? I read about hadoop distributed cache looks like it's copies the file (hdfs / http) locally on the slaves but not necessrily in RAM ?? You could mmap the file from distributed cache using MappedByteBuffer. Then there will be one copy between jvms... -- Owen -- ted
Distributed cache Design
Hey guys, We at Linkedin are trying to run some Large Graph Analysis problems on Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-cores machine with 8G on each. Whats is the best way of doing that ?? Is there a way so that multiple mappers on same machine can access a RAM cache ?? I read about hadoop distributed cache looks like it's copies the file (hdfs / http) locally on the slaves but not necessrily in RAM ?? Best Bhupesh
Re: Distributed cache Design
Minor correction the graph size is about 6G and not 8G. On 10/16/08 1:52 PM, Bhupesh Bansal [EMAIL PROTECTED] wrote: Hey guys, We at Linkedin are trying to run some Large Graph Analysis problems on Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-cores machine with 8G on each. Whats is the best way of doing that ?? Is there a way so that multiple mappers on same machine can access a RAM cache ?? I read about hadoop distributed cache looks like it's copies the file (hdfs / http) locally on the slaves but not necessrily in RAM ?? Best Bhupesh
Re: Distributed cache Design
At Freebase, we're mapping our large graphs into very large files of triples in HDFS and running large queries over them. Hadoop is optimized for processing streaming data off of disk, and we've found that trying to load a multi-GB graph and then access it in a Hadoop task has scaling problems. Mapping the graph to an on-disk representation as a bunch of interlocking or overlapping subgraphs works very well. -Colin Bhupesh Bansal wrote: Hey guys, We at Linkedin are trying to run some Large Graph Analysis problems on Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-cores machine with 8G on each. Whats is the best way of doing that ?? Is there a way so that multiple mappers on same machine can access a RAM cache ?? I read about hadoop distributed cache looks like it's copies the file (hdfs / http) locally on the slaves but not necessrily in RAM ?? Best Bhupesh
Re: Distributed cache Design
Bhupesh Bansal wrote: Minor correction the graph size is about 6G and not 8G. Ah, that's better. With the jvm reuse feature in 0.19 you should be able to load it once per job into a static, since all tasks of that job can share a JVM. Things will get tight if you try to run two such jobs at once, since JVMs are only shared by a single job. https://issues.apache.org/jira/browse/HADOOP-249 Doug
Re: Distributed cache Design
On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote: We at Linkedin are trying to run some Large Graph Analysis problems on Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM at all mappers. (Graph size is about 8G in RAM) we have cluster of 8- cores machine with 8G on each. The best way to deal with it is *not* to load the entire graph in one process. In the WebMap at Yahoo, we have a graph of the web that has roughly 1 trillion links and 100 billion nodes. See http://tinyurl.com/4fgok6 . To invert the links, you process the graph in pieces and resort based on the target. You'll get much better performance and scale to almost any size. Whats is the best way of doing that ?? Is there a way so that multiple mappers on same machine can access a RAM cache ?? I read about hadoop distributed cache looks like it's copies the file (hdfs / http) locally on the slaves but not necessrily in RAM ?? You could mmap the file from distributed cache using MappedByteBuffer. Then there will be one copy between jvms... -- Owen
Re: Distributed cache Design
Can you elaborate here , Lets say I want to implement a DFS in my graph. I am not able to picturise implementing it with doing graph in pieces without putting a depth bound to (3-4). Lets say we have 200M (4GB) edges to start with Best Bhupesh On 10/16/08 3:01 PM, Owen O'Malley [EMAIL PROTECTED] wrote: On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote: We at Linkedin are trying to run some Large Graph Analysis problems on Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM at all mappers. (Graph size is about 8G in RAM) we have cluster of 8- cores machine with 8G on each. The best way to deal with it is *not* to load the entire graph in one process. In the WebMap at Yahoo, we have a graph of the web that has roughly 1 trillion links and 100 billion nodes. See http://tinyurl.com/4fgok6 . To invert the links, you process the graph in pieces and resort based on the target. You'll get much better performance and scale to almost any size. Whats is the best way of doing that ?? Is there a way so that multiple mappers on same machine can access a RAM cache ?? I read about hadoop distributed cache looks like it's copies the file (hdfs / http) locally on the slaves but not necessrily in RAM ?? You could mmap the file from distributed cache using MappedByteBuffer. Then there will be one copy between jvms... -- Owen
Re: Distributed cache Design
On Oct 16, 2008, at 3:09 PM, Bhupesh Bansal wrote: Lets say I want to implement a DFS in my graph. I am not able to picturise implementing it with doing graph in pieces without putting a depth bound to (3-4). Lets say we have 200M (4GB) edges to start with Start by watching the lecture on graph algorithms in map/reduce: http://www.youtube.com/watch?v=BT-piFBP4fE And see if that makes it clearer. If not, ask more questions. *smile* -- Owen
Re: Distributed cache Design
Thanks Colin/ Owen I will try some of the ideas here and report back. Best Bhupesh On 10/16/08 4:05 PM, Colin Evans [EMAIL PROTECTED] wrote: The trick is to amortize your computation over the whole set. So DFS for a single node will always be faster on an in-memory graph, but Hadoop is a good tool for computing all-pairs shortest paths in one shot if you re-frame the algorithm as a belief propagation and message passing algorithm. A lot of the time, the computation still explodes into n^2 or worse, so you need to use a binning or blocking algorithm, like the one described here: http://www.youtube.com/watch?v=1ZDybXl212Q In the case of graphs, a blocking function would be to find overlapping strongly connected subgraphs where each subgraph fits in a reasonable amount of memory. Then within each block, you do your computation and you pass a summary of that computation to adjacent blocks,which gets factored into the next computation. When we hooked up a Very Big Graph to our Hadoop cluster, we found that there were a lot of scaling problems, which went away when we started optimizing for streaming performance. -Colin Bhupesh Bansal wrote: Can you elaborate here , Lets say I want to implement a DFS in my graph. I am not able to picturise implementing it with doing graph in pieces without putting a depth bound to (3-4). Lets say we have 200M (4GB) edges to start with Best Bhupesh On 10/16/08 3:01 PM, Owen O'Malley [EMAIL PROTECTED] wrote: On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote: We at Linkedin are trying to run some Large Graph Analysis problems on Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM at all mappers. (Graph size is about 8G in RAM) we have cluster of 8- cores machine with 8G on each. The best way to deal with it is *not* to load the entire graph in one process. In the WebMap at Yahoo, we have a graph of the web that has roughly 1 trillion links and 100 billion nodes. See http://tinyurl.com/4fgok6 . To invert the links, you process the graph in pieces and resort based on the target. You'll get much better performance and scale to almost any size. Whats is the best way of doing that ?? Is there a way so that multiple mappers on same machine can access a RAM cache ?? I read about hadoop distributed cache looks like it's copies the file (hdfs / http) locally on the slaves but not necessrily in RAM ?? You could mmap the file from distributed cache using MappedByteBuffer. Then there will be one copy between jvms... -- Owen