The graph that my BFS algorithm is running on only needs 4 levels to reach all nodes. The reason I say "many iterations" is that there are 100 sources nodes, so totally 400 iterations. The algorithm should be right, and I cannot think of anything to reduce the number of iterations.
Ted, I will check out the links that you sent to me. I really appreciate your help. Wei -----Original Message----- From: Edward J. Yoon [mailto:[email protected]] Sent: Tuesday, December 21, 2010 1:27 AM To: [email protected] Subject: Re: breadth-first search There's no release yet. But, I had tested the BFS using hama and, hbase. Sent from my iPhone On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <[email protected]> wrote: > Yoon, > > Can I use HAMA now, or it is still in development? > > Thanks > > Wei > > -----Original Message----- > From: Edward J. Yoon [mailto:[email protected]] > Sent: Monday, December 20, 2010 6:23 PM > To: [email protected] > Subject: Re: breadth-first search > > Check this slide out - > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf > > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[email protected]> wrote: >> >> I implemented an algorithm to run hadoop on a 25GB graph data to >> calculate its average separation length. >> The input format is V1(tab)V2 (where V2 is the friend of V1). >> My purpose is to first randomly select some seed nodes, and then for >> each node, calculate the shortest paths from this node to all other >> nodes on the graph. >> >> To do this, I first run a simple python code in a single machine to get >> some random seed nodes. >> Then I run a hadoop job to generate adjacent list for each node as the >> input for the second job. >> >> The second job takes the adjacent list input and output the first level >> breadth-first search result. The nodes which are the friends of the seed >> node have distance 1. Then this output is the input for the next hadoop >> job so on so forth, until all the nodes are reached. >> >> I generated a simulated graph for testing. This data has only 100 nodes. >> Normal python code can find the separation length within 1 second (100 >> seed nodes). However, the hadoop took almost 3 hours to do that >> (pseudo-distributed mode on one machine)!! >> >> I wonder if there is a more efficient way to do breadth-first search in >> hadoop? It is very inefficient to output so many intermediate results. >> Totally there would be seedNodeNumber*levelNumber+1 jobs, >> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? >> >> Please help. Thanks! >> >> Wei >> > > > > -- > Best Regards, Edward J. Yoon > [email protected] > http://blog.udanax.org
