RE: breadth-first search

Peng, Wei Mon, 20 Dec 2010 20:17:27 -0800

Thanks Ricky.
I do not think the Dijkstra is more efficient than BFS (costly when
looking for the node with minimum distance, and we do not need to do
that when there is no edge weight). BFS should be the special version of
Dijkstra when the weight of edges are all equal.
In your algorithm, you are saving everything in memory. That is
unrealistic for a large graph.
My question is really about what is the efficient way for graph
computation, matrix computation, algorithms that need many iterations to
converge (with intermediate results).
HAMA looks like a very good solution, but can we use it now and how to
use it?

Wei

-----Original Message-----
From: Ricky Ho [mailto:[email protected]] 
Sent: Monday, December 20, 2010 7:01 PM
To: [email protected]
Subject: RE: breadth-first search

 I also blog about how to to Single Source Shortest Path here at 
http://horicky.blogspot.com/2010/02/nosql-graphdb.html

One MR algorithm is based on Dijkstra and the other is based on BFS.  I
think 
the first one is more efficient than the second one.

Rgds,
Ricky

-----Original Message-----
From: Peng, Wei [mailto:[email protected]] 
Sent: Monday, December 20, 2010 5:50 PM
To: [email protected]
Subject: breadth-first search 

 I implemented an algorithm to run hadoop on a 25GB graph data to
calculate its average separation length.
The input format is V1(tab)V2 (where V2 is the friend of V1).
My purpose is to first randomly select some seed nodes, and then for
each node, calculate the shortest paths from this node to all other
nodes on the graph.

To do this, I first run a simple python code in a single machine to get
some random seed nodes.
Then I run a hadoop job to generate adjacent list for each node as the
input for the second job.

The second job takes the adjacent list input and output the first level
breadth-first search result. The nodes which are the friends of the seed
node have distance 1. Then this output is the input for the next hadoop
job so on so forth, until all the nodes are reached.

I generated a simulated graph for testing. This data has only 100 nodes.
Normal python code can find the separation length within 1 second (100
seed nodes). However, the hadoop took almost 3 hours to do that
(pseudo-distributed mode on one machine)!!

I wonder if there is a more efficient way to do breadth-first search in
hadoop? It is very inefficient to output so many intermediate results.
Totally there would be seedNodeNumber*levelNumber+1 jobs,
seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? 

Please help.  Thanks!

Wei

RE: breadth-first search

Reply via email to