Thanks, yup our format looks something like that. Even more simple than our solution. I will try this out and compare.
Cheers /M On Sat, Jul 4, 2009 at 3:48 AM, Ted Dunning <[email protected]> wrote: > On Fri, Jul 3, 2009 at 4:36 PM, Marcus Herou <[email protected] > >wrote: > > > I understand what you are saying but the theory do not really get into my > > head... You mean that the latency in the CPU + Disk-IO is something like > > 10000 times less (or perhaps more) than the latency between calling a > > remote > > system via sockets ? I can agree on that. > > > > yes. > > exactly. By reading data sequentially, things move vastly faster. > > > > Please point out some code which uses MR so I can examine and test for > > myself or use the back your envelope and describe what I need to do make > it > > happen. > > > > Several of the posters in this thread have already done that. > > > > What system are you using to get the inlinks/outlinks from a node ? We > map > > the matrix up beforehand using lucene and rsync it out on all machines. > > Every MR job then uses the same static index. > > > > You have to include the time to convert your matrix and rsync it to all > machines to make a fair comparison. Also, but distributing all data to all > nodes, you are converting a process which is nearly linear in the size of > your data into a process that is quadratic. Other scaling factors get > worse > as well. > > Try moving data in flat files. Trivial is best here. Hadoop does the data > distribution and ensures that scaling works well. > > One common format is to put the node name at the beginning of a line and > follow with tab delimited linked nodes. Another file has the node name and > page rank on a line. The mapper generates an output record for each of > the > linked nodes with a weight, the combiner sums weights and the reducer > produces a new page rank file. All disk access is completely sequential, > each node only deals with a small part of the data and things work very > very > well. > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [email protected] http://www.tailsweep.com/
