Exactly, it should be done in 4 iterations instead of 4 * 100 iterations. But in general, Graph algorithms is harder to do in Map/Reduce and the statelessness nature typically requires shuffling of the whole graph in each iteration. Jimmy Lin has described a technique called Skimmy to avoid that, which I described here at http://horicky.blogspot.com/2010/07/graph-processing-in-map-reduce.html
And of course, the BSP style is another good way to achieve this which is where Google's Pregel model based on, I also blog about this here at http://horicky.blogspot.com/2010/07/google-pregel-graph-processing.html Notice that "scalability" and "speed" are different animals. Hadoop is about scalability but not speed. If your data can be squeezed into a single machine, then Hadoop is not for you. Otherwise, Hadoop is one great tool to parallelize your processing. It has a pretty constant (but significant) overhead that is justifiable when you have large amount of data. For example, Hadoop is not for processing real-time streams. http://horicky.blogspot.com/2010/11/map-reduce-and-stream-processing.html Notice that ... Hadoop is just one of the tool that you can use, and you decide whether you don't have better ones. Rgds, Ricky -----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Tuesday, December 21, 2010 7:29 AM To: [email protected] Subject: Re: breadth-first search Ahh... I see what you mean. This algorithm can be implemented with all of the iterations for all points proceeding in parallel. You should only need 4 map-reduce steps, not 400. This will still take several minutes on Hadoop, but as your problem increases in size, this overhead becomes less important. 2010/12/21 Peng, Wei <[email protected]> > The graph that my BFS algorithm is running on only needs 4 levels to reach > all nodes. The reason I say "many iterations" is that there are 100 sources > nodes, so totally 400 iterations. The algorithm should be right, and I > cannot think of anything to reduce the number of iterations. > > Ted, I will check out the links that you sent to me. > I really appreciate your help. > > Wei > -----Original Message----- > From: Edward J. Yoon [mailto:[email protected]] > Sent: Tuesday, December 21, 2010 1:27 AM > To: [email protected] > Subject: Re: breadth-first search > > There's no release yet. > > But, I had tested the BFS using hama and, hbase. > > Sent from my iPhone > > On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <[email protected]> wrote: > > > Yoon, > > > > Can I use HAMA now, or it is still in development? > > > > Thanks > > > > Wei > > > > -----Original Message----- > > From: Edward J. Yoon [mailto:[email protected]] > > Sent: Monday, December 20, 2010 6:23 PM > > To: [email protected] > > Subject: Re: breadth-first search > > > > Check this slide out - > > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf > > > > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[email protected]> wrote: > >> > >> I implemented an algorithm to run hadoop on a 25GB graph data to > >> calculate its average separation length. > >> The input format is V1(tab)V2 (where V2 is the friend of V1). > >> My purpose is to first randomly select some seed nodes, and then for > >> each node, calculate the shortest paths from this node to all other > >> nodes on the graph. > >> > >> To do this, I first run a simple python code in a single machine to get > >> some random seed nodes. > >> Then I run a hadoop job to generate adjacent list for each node as the > >> input for the second job. > >> > >> The second job takes the adjacent list input and output the first level > >> breadth-first search result. The nodes which are the friends of the seed > >> node have distance 1. Then this output is the input for the next hadoop > >> job so on so forth, until all the nodes are reached. > >> > >> I generated a simulated graph for testing. This data has only 100 nodes. > >> Normal python code can find the separation length within 1 second (100 > >> seed nodes). However, the hadoop took almost 3 hours to do that > >> (pseudo-distributed mode on one machine)!! > >> > >> I wonder if there is a more efficient way to do breadth-first search in > >> hadoop? It is very inefficient to output so many intermediate results. > >> Totally there would be seedNodeNumber*levelNumber+1 jobs, > >> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? > >> > >> Please help. Thanks! > >> > >> Wei > >> > > > > > > > > -- > > Best Regards, Edward J. Yoon > > [email protected] > > http://blog.udanax.org >
