On Fri, Jul 3, 2009 at 4:36 PM, Marcus Herou <[email protected]>wrote:
> I understand what you are saying but the theory do not really get into my > head... You mean that the latency in the CPU + Disk-IO is something like > 10000 times less (or perhaps more) than the latency between calling a > remote > system via sockets ? I can agree on that. > yes. exactly. By reading data sequentially, things move vastly faster. > Please point out some code which uses MR so I can examine and test for > myself or use the back your envelope and describe what I need to do make it > happen. > Several of the posters in this thread have already done that. > What system are you using to get the inlinks/outlinks from a node ? We map > the matrix up beforehand using lucene and rsync it out on all machines. > Every MR job then uses the same static index. > You have to include the time to convert your matrix and rsync it to all machines to make a fair comparison. Also, but distributing all data to all nodes, you are converting a process which is nearly linear in the size of your data into a process that is quadratic. Other scaling factors get worse as well. Try moving data in flat files. Trivial is best here. Hadoop does the data distribution and ensures that scaling works well. One common format is to put the node name at the beginning of a line and follow with tab delimited linked nodes. Another file has the node name and page rank on a line. The mapper generates an output record for each of the linked nodes with a weight, the combiner sums weights and the reducer produces a new page rank file. All disk access is completely sequential, each node only deals with a small part of the data and things work very very well.
