Re: breadth-first search

Edward J. Yoon Mon, 20 Dec 2010 18:23:05 -0800

Check this slide out -
http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf


On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[email protected]> wrote:
>
>  I implemented an algorithm to run hadoop on a 25GB graph data to
> calculate its average separation length.
> The input format is V1(tab)V2 (where V2 is the friend of V1).
> My purpose is to first randomly select some seed nodes, and then for
> each node, calculate the shortest paths from this node to all other
> nodes on the graph.
>
> To do this, I first run a simple python code in a single machine to get
> some random seed nodes.
> Then I run a hadoop job to generate adjacent list for each node as the
> input for the second job.
>
> The second job takes the adjacent list input and output the first level
> breadth-first search result. The nodes which are the friends of the seed
> node have distance 1. Then this output is the input for the next hadoop
> job so on so forth, until all the nodes are reached.
>
> I generated a simulated graph for testing. This data has only 100 nodes.
> Normal python code can find the separation length within 1 second (100
> seed nodes). However, the hadoop took almost 3 hours to do that
> (pseudo-distributed mode on one machine)!!
>
> I wonder if there is a more efficient way to do breadth-first search in
> hadoop? It is very inefficient to output so many intermediate results.
> Totally there would be seedNodeNumber*levelNumber+1 jobs,
> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow?
>
> Please help.  Thanks!
>
> Wei
>



-- 
Best Regards, Edward J. Yoon
[email protected]
http://blog.udanax.org

Re: breadth-first search

Reply via email to