Check this slide out - http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf
On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[email protected]> wrote: > > I implemented an algorithm to run hadoop on a 25GB graph data to > calculate its average separation length. > The input format is V1(tab)V2 (where V2 is the friend of V1). > My purpose is to first randomly select some seed nodes, and then for > each node, calculate the shortest paths from this node to all other > nodes on the graph. > > To do this, I first run a simple python code in a single machine to get > some random seed nodes. > Then I run a hadoop job to generate adjacent list for each node as the > input for the second job. > > The second job takes the adjacent list input and output the first level > breadth-first search result. The nodes which are the friends of the seed > node have distance 1. Then this output is the input for the next hadoop > job so on so forth, until all the nodes are reached. > > I generated a simulated graph for testing. This data has only 100 nodes. > Normal python code can find the separation length within 1 second (100 > seed nodes). However, the hadoop took almost 3 hours to do that > (pseudo-distributed mode on one machine)!! > > I wonder if there is a more efficient way to do breadth-first search in > hadoop? It is very inefficient to output so many intermediate results. > Totally there would be seedNodeNumber*levelNumber+1 jobs, > seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? > > Please help. Thanks! > > Wei > -- Best Regards, Edward J. Yoon [email protected] http://blog.udanax.org
