Re: Distributed cache Design

2008-10-20 Thread Ted Dunning
I was very surprised by this as well.  I was doing variants on all-pairs
shortest paths and found that the best representation really was triples
containing from-node, to-node and distance.  The nice side of this is that
you get scaling like you wouldn't believe (subject to big-omega, of course)

On Thu, Oct 16, 2008 at 4:05 PM, Colin Evans [EMAIL PROTECTED] wrote:

 The trick is to amortize your computation over the whole set.  So DFS for a
 single node will always be faster on an in-memory graph, but Hadoop is a
 good tool for computing all-pairs shortest paths in one shot if you re-frame
 the algorithm as a belief propagation and message passing algorithm.

 A lot of the time, the computation still explodes into n^2 or worse, so you
 need to use a binning or blocking algorithm, like the one described here:
 http://www.youtube.com/watch?v=1ZDybXl212Q

 In the case of graphs, a blocking function would be to find overlapping
 strongly connected subgraphs where each subgraph fits in a reasonable amount
 of memory.  Then within each block, you do your computation and you pass a
 summary of that computation to adjacent blocks,which gets factored into the
 next computation.

 When we hooked up a Very Big Graph to our Hadoop cluster, we found that
 there were a lot of scaling problems, which went away when we started
 optimizing for streaming performance.

 -Colin




 Bhupesh Bansal wrote:

 Can you elaborate here ,

 Lets say I want to implement a DFS in my graph. I am not able to picturise
 implementing it with doing graph in pieces without putting a depth bound
 to
 (3-4). Lets say we have 200M (4GB) edges to start with

 Best
 Bhupesh



 On 10/16/08 3:01 PM, Owen O'Malley [EMAIL PROTECTED] wrote:



 On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote:



 We at Linkedin are trying to run some Large Graph Analysis problems on
 Hadoop. The fastest way to run would be to keep a copy of whole
 Graph in RAM
 at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-
 cores
 machine with 8G on each.


 The best way to deal with it is *not* to load the entire graph in one
 process. In the WebMap at Yahoo, we have a graph of the web that has
 roughly 1 trillion links and 100 billion nodes. See
 http://tinyurl.com/4fgok6
  . To invert the links, you process the graph in pieces and resort
 based on the target. You'll get much better performance and scale to
 almost any size.



 Whats is the best way of doing that ?? Is there a way so that multiple
 mappers on same machine can access a RAM cache ??  I read about hadoop
 distributed cache looks like it's copies the file (hdfs / http)
 locally on
 the slaves but not necessrily in RAM ??


 You could mmap the file from distributed cache using MappedByteBuffer.
 Then there will be one copy between jvms...

 -- Owen









-- 
ted


Distributed cache Design

2008-10-16 Thread Bhupesh Bansal
Hey guys, 


We at Linkedin are trying to run some Large Graph Analysis problems on
Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM
at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-cores
machine with 8G on each.

Whats is the best way of doing that ?? Is there a way so that multiple
mappers on same machine can access a RAM cache ??  I read about hadoop
distributed cache looks like it's copies the file (hdfs / http) locally on
the slaves but not necessrily in RAM ??

Best
Bhupesh



Re: Distributed cache Design

2008-10-16 Thread Bhupesh Bansal
Minor correction the graph size is about 6G and not 8G.


On 10/16/08 1:52 PM, Bhupesh Bansal [EMAIL PROTECTED] wrote:

 Hey guys, 
 
 
 We at Linkedin are trying to run some Large Graph Analysis problems on
 Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM
 at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-cores
 machine with 8G on each.
 
 Whats is the best way of doing that ?? Is there a way so that multiple
 mappers on same machine can access a RAM cache ??  I read about hadoop
 distributed cache looks like it's copies the file (hdfs / http) locally on
 the slaves but not necessrily in RAM ??
 
 Best
 Bhupesh
 



Re: Distributed cache Design

2008-10-16 Thread Colin Evans
At Freebase, we're mapping our large graphs into very large files of 
triples in HDFS and running large queries over them. 

Hadoop is optimized for processing streaming data off of disk, and we've 
found that trying to load a multi-GB graph and then access it in a 
Hadoop task has scaling problems.  Mapping the graph to an on-disk 
representation as a bunch of interlocking or overlapping subgraphs works 
very well.


-Colin


Bhupesh Bansal wrote:
Hey guys, 



We at Linkedin are trying to run some Large Graph Analysis problems on
Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM
at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-cores
machine with 8G on each.

Whats is the best way of doing that ?? Is there a way so that multiple
mappers on same machine can access a RAM cache ??  I read about hadoop
distributed cache looks like it's copies the file (hdfs / http) locally on
the slaves but not necessrily in RAM ??

Best
Bhupesh

  




Re: Distributed cache Design

2008-10-16 Thread Doug Cutting

Bhupesh Bansal wrote:

Minor correction the graph size is about 6G and not 8G.


Ah, that's better.

With the jvm reuse feature in 0.19 you should be able to load it once 
per job into a static, since all tasks of that job can share a JVM. 
Things will get tight if you try to run two such jobs at once, since 
JVMs are only shared by a single job.


https://issues.apache.org/jira/browse/HADOOP-249

Doug




Re: Distributed cache Design

2008-10-16 Thread Owen O'Malley


On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote:


We at Linkedin are trying to run some Large Graph Analysis problems on
Hadoop. The fastest way to run would be to keep a copy of whole  
Graph in RAM
at all mappers. (Graph size is about 8G in RAM) we have cluster of 8- 
cores

machine with 8G on each.


The best way to deal with it is *not* to load the entire graph in one  
process. In the WebMap at Yahoo, we have a graph of the web that has  
roughly 1 trillion links and 100 billion nodes. See http://tinyurl.com/4fgok6 
 . To invert the links, you process the graph in pieces and resort  
based on the target. You'll get much better performance and scale to  
almost any size.



Whats is the best way of doing that ?? Is there a way so that multiple
mappers on same machine can access a RAM cache ??  I read about hadoop
distributed cache looks like it's copies the file (hdfs / http)  
locally on

the slaves but not necessrily in RAM ??


You could mmap the file from distributed cache using MappedByteBuffer.  
Then there will be one copy between jvms...


-- Owen


Re: Distributed cache Design

2008-10-16 Thread Bhupesh Bansal
Can you elaborate here ,

Lets say I want to implement a DFS in my graph. I am not able to picturise
implementing it with doing graph in pieces without putting a depth bound to
(3-4). Lets say we have 200M (4GB) edges to start with

Best
Bhupesh



On 10/16/08 3:01 PM, Owen O'Malley [EMAIL PROTECTED] wrote:

 
 On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote:
 
 We at Linkedin are trying to run some Large Graph Analysis problems on
 Hadoop. The fastest way to run would be to keep a copy of whole
 Graph in RAM
 at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-
 cores
 machine with 8G on each.
 
 The best way to deal with it is *not* to load the entire graph in one
 process. In the WebMap at Yahoo, we have a graph of the web that has
 roughly 1 trillion links and 100 billion nodes. See http://tinyurl.com/4fgok6
   . To invert the links, you process the graph in pieces and resort
 based on the target. You'll get much better performance and scale to
 almost any size.
 
 Whats is the best way of doing that ?? Is there a way so that multiple
 mappers on same machine can access a RAM cache ??  I read about hadoop
 distributed cache looks like it's copies the file (hdfs / http)
 locally on
 the slaves but not necessrily in RAM ??
 
 You could mmap the file from distributed cache using MappedByteBuffer.
 Then there will be one copy between jvms...
 
 -- Owen



Re: Distributed cache Design

2008-10-16 Thread Owen O'Malley


On Oct 16, 2008, at 3:09 PM, Bhupesh Bansal wrote:

Lets say I want to implement a DFS in my graph. I am not able to  
picturise
implementing it with doing graph in pieces without putting a depth  
bound to

(3-4). Lets say we have 200M (4GB) edges to start with


Start by watching the lecture on graph algorithms in map/reduce:

http://www.youtube.com/watch?v=BT-piFBP4fE

And see if that makes it clearer. If not, ask more questions. *smile*

-- Owen


Re: Distributed cache Design

2008-10-16 Thread Bhupesh Bansal
Thanks Colin/ Owen

I will try some of the ideas here and report back.

Best
Bhupesh



On 10/16/08 4:05 PM, Colin Evans [EMAIL PROTECTED] wrote:

 The trick is to amortize your computation over the whole set.  So DFS
 for a single node will always be faster on an in-memory graph, but
 Hadoop is a good tool for computing all-pairs shortest paths in one shot
 if you re-frame the algorithm as a belief propagation and message
 passing algorithm.
 
 A lot of the time, the computation still explodes into n^2 or worse, so
 you need to use a binning or blocking algorithm, like the one described
 here:  http://www.youtube.com/watch?v=1ZDybXl212Q
 
 In the case of graphs, a blocking function would be to find overlapping
 strongly connected subgraphs where each subgraph fits in a reasonable
 amount of memory.  Then within each block, you do your computation and
 you pass a summary of that computation to adjacent blocks,which gets
 factored into the next computation.
 
 When we hooked up a Very Big Graph to our Hadoop cluster, we found that
 there were a lot of scaling problems, which went away when we started
 optimizing for streaming performance.
 
 -Colin
 
 
 
 Bhupesh Bansal wrote:
 Can you elaborate here ,
 
 Lets say I want to implement a DFS in my graph. I am not able to picturise
 implementing it with doing graph in pieces without putting a depth bound to
 (3-4). Lets say we have 200M (4GB) edges to start with
 
 Best
 Bhupesh
 
 
 
 On 10/16/08 3:01 PM, Owen O'Malley [EMAIL PROTECTED] wrote:
 
   
 On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote:
 
 
 We at Linkedin are trying to run some Large Graph Analysis problems on
 Hadoop. The fastest way to run would be to keep a copy of whole
 Graph in RAM
 at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-
 cores
 machine with 8G on each.
   
 The best way to deal with it is *not* to load the entire graph in one
 process. In the WebMap at Yahoo, we have a graph of the web that has
 roughly 1 trillion links and 100 billion nodes. See
 http://tinyurl.com/4fgok6
   . To invert the links, you process the graph in pieces and resort
 based on the target. You'll get much better performance and scale to
 almost any size.
 
 
 Whats is the best way of doing that ?? Is there a way so that multiple
 mappers on same machine can access a RAM cache ??  I read about hadoop
 distributed cache looks like it's copies the file (hdfs / http)
 locally on
 the slaves but not necessrily in RAM ??
   
 You could mmap the file from distributed cache using MappedByteBuffer.
 Then there will be one copy between jvms...
 
 -- Owen