Miles,

Thanks for your response.

I think I understand.. basically, I'm adding a combiner class that computes
the partial results in phase 2.  correct (just like in the word count
example)?

However, even if I do that, I don't think it gets rid of the overhead of
reading 48 GB from disk back into memory, which I think may be a large part
of the problem. Would you agree that that may be the largest source of
overhead?

  Is there something in hadoop that will allow me to prevent the writing of
the output of first reduce phase to disk, and instead just store it and
place it directly as input for the second reduce?

Thanks,

-SM

On Fri, Sep 19, 2008 at 3:13 PM, Miles Osborne <[EMAIL PROTECTED]> wrote:

> if each mapper only sees a relatively small chunk of the data, then
> why not have each one compute the counting of 2-perms in memory.
>
> you would then get the reducer to merge these partial results together.
>
> (details are left to the reader ...)
>
> Miles
>
> 2008/9/19 Sandy <[EMAIL PROTECTED]>:
> > Hi,
> >
> > I have a serious problem that I'm not sure how to fix. I have two M/R
> phases
> > that calculates a matrix in parallel. It works... but it's slower than
> the
> > serial version (by about 100 times).
> >
> > Here is a toy program that works similarly to my application. In this
> > example I'm having different random numbers being generated, per given
> line
> > of input, and then creating a n x n matrix that counts how many of these
> > random numbers were shared.
> >
> > -------------------
> > first map phase() {
> > input: key = offset, value = line of text (embedded line number), ln
> > generate k random numbers, k1 .. kn
> > emit: <ki, ln >
> > }
> >
> > first reduce phase() {
> > input: key = ki, value = list(ln)
> > if list size is greater than one:
> >   for every 2-permutation p:
> >      emit: <p, 1>
> >    //example: list = 1 2 3
> >    //emit: <(1,2), 1>
> >    //emit: <(2,3), 1>
> >    //emit: <(1,3), 1>
> > }
> >
> > second map phase() {
> > input: key = offset, value = (i, j) 1
> > //dummy function. acts as a transition to reduce
> > parse value into two tokens [(i,j)] and [1]
> > emit: <(i,j), 1>
> > }
> >
> > second reduce() {
> > input: key = (i,j)  value = list(1)
> > //wordcount
> > sum up the list of ones
> > emit: <(i,j), sum(list(1))>
> > }
> > ------------------
> >
> > Now here's the problem:
> >
> > Let's suppose the file is 27MB.
> > The time it takes for the first map phase is about 3 minutes.
> > The time it takes for the first reduce phase is about 1 hour.
> > The size of the intermediary files that are produced by this first M/R
> phase
> > is 48GB.
> >
> > The time it takes for the second map phase is 9 hours (and this function
> is
> > just a dummy funtion!!)
> > The time it takes for the second reduce phase is 12 hours
> >
> > I have been trying to change the number of maps and reduce tasks, but
> that
> > doesn't seem to really chip away at the massive number of 2-permutations
> > that need to be taken care of in the second M/R phase. At least not on my
> > current machine.
> >
> >
> > Has anyone implemented a matrix in parallel using MapReduce? If so, Is
> this
> > normal or expected behavior? I do realize that I have a small input file,
> > and that this may impact speedup. The most powerful machine I have to run
> > this M/R implementation is a MacPro that has two processors, each with
> four
> > cores, and 4 different hard disks of 1 TB each.
> >
> > Does anyone have any suggestions on what I can change (either with the
> > approach or the cluster setup -- do I need more machines?) in order to
> make
> > this faster? I am current running 8 map tasks and 4 reduce tasks. I am
> going
> > to change it 10 map tasks and 9 reduce tasks and see if that helps any,
> but
> > I'm seriously wondering if this is not going to give me much of a change
> > since I only have one machine.
> >
> >
> > Any insight is greatly appreciated.
> >
> > Thanks!
> >
> > -SM
> >
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>

Reply via email to