Thanks, yup our format looks something like that.

Even more simple than our solution. I will try this out and compare.

Cheers

/M

On Sat, Jul 4, 2009 at 3:48 AM, Ted Dunning <[email protected]> wrote:

> On Fri, Jul 3, 2009 at 4:36 PM, Marcus Herou <[email protected]
> >wrote:
>
> > I understand what you are saying but the theory do not really get into my
> > head... You mean that the latency in the CPU + Disk-IO is something like
> > 10000 times less (or perhaps more) than the latency between calling a
> > remote
> > system via sockets ? I can agree on that.
> >
>
> yes.
>
> exactly.  By reading data sequentially, things move vastly faster.
>
>
> > Please point out some code which uses MR so I can examine and test for
> > myself or use the back your envelope and describe what I need to do make
> it
> > happen.
> >
>
> Several of the posters in this thread have already done that.
>
>
> > What system are you using to get the inlinks/outlinks from a node ? We
> map
> > the matrix up beforehand using lucene and rsync it out on all machines.
> > Every MR job then uses the same static index.
> >
>
> You have to include the time to convert your matrix and rsync it to all
> machines to make a fair comparison.  Also, but distributing all data to all
> nodes, you are converting a process which is nearly linear in the size of
> your data into a process that is quadratic.  Other scaling factors get
> worse
> as well.
>
> Try moving data in flat files.  Trivial is best here.  Hadoop does the data
> distribution and ensures that scaling works well.
>
> One common format is to  put the node name at the beginning of a line and
> follow with tab delimited linked nodes.  Another file has the node name and
> page rank on a line.   The mapper generates an output record for each of
> the
> linked nodes with a weight, the combiner sums weights and the reducer
> produces a new page rank file.  All disk access is completely sequential,
> each node only deals with a small part of the data and things work very
> very
> well.
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[email protected]
http://www.tailsweep.com/

Reply via email to