Re: Parallell maps

Ted Dunning Fri, 03 Jul 2009 18:48:53 -0700

On Fri, Jul 3, 2009 at 4:36 PM, Marcus Herou <[email protected]>wrote:


> I understand what you are saying but the theory do not really get into my
> head... You mean that the latency in the CPU + Disk-IO is something like
> 10000 times less (or perhaps more) than the latency between calling a
> remote
> system via sockets ? I can agree on that.
>

yes.

exactly.  By reading data sequentially, things move vastly faster.


> Please point out some code which uses MR so I can examine and test for
> myself or use the back your envelope and describe what I need to do make it
> happen.
>

Several of the posters in this thread have already done that.


> What system are you using to get the inlinks/outlinks from a node ? We map
> the matrix up beforehand using lucene and rsync it out on all machines.
> Every MR job then uses the same static index.
>

You have to include the time to convert your matrix and rsync it to all
machines to make a fair comparison.  Also, but distributing all data to all
nodes, you are converting a process which is nearly linear in the size of
your data into a process that is quadratic.  Other scaling factors get worse
as well.

Try moving data in flat files.  Trivial is best here.  Hadoop does the data
distribution and ensures that scaling works well.

One common format is to  put the node name at the beginning of a line and
follow with tab delimited linked nodes.  Another file has the node name and
page rank on a line.   The mapper generates an output record for each of the
linked nodes with a weight, the combiner sums weights and the reducer
produces a new page rank file.  All disk access is completely sequential,
each node only deals with a small part of the data and things work very very
well.

Re: Parallell maps

Reply via email to