for the word count example, you would have each mapper work-out the
counts of all words, but in-memory.  (eg having a hash table and
increment the count each time you saw it).  the merge phrase would
then simply add together each word-count pair which had the same
count.   more concretely, the mapper would emit the 2-perm count pairs
and the reducer would simply sum them up.

continuing with word counting, the basic MR way of doing it maximises
disk space usage and minimises memory usage.  what i imagine is that
you can have stages inbetween --use more memory than the pure MR
stateless approach,  but gain in speed.

returning to your example, you may not need to read-in that 48 GB of
data if your counts are fairly dense (you see many instances of a
given 2-pem entry).  if they are sparse, then you are probably doomed.

Miles

2008/9/19 Sandy <[EMAIL PROTECTED]>:
> Miles,
>
> Thanks for your response.
>
> I think I understand.. basically, I'm adding a combiner class that computes
> the partial results in phase 2.  correct (just like in the word count
> example)?
>
> However, even if I do that, I don't think it gets rid of the overhead of
> reading 48 GB from disk back into memory, which I think may be a large part
> of the problem. Would you agree that that may be the largest source of
> overhead?
>
>  Is there something in hadoop that will allow me to prevent the writing of
> the output of first reduce phase to disk, and instead just store it and
> place it directly as input for the second reduce?
>
> Thanks,
>
> -SM
>
> On Fri, Sep 19, 2008 at 3:13 PM, Miles Osborne <[EMAIL PROTECTED]> wrote:
>
>> if each mapper only sees a relatively small chunk of the data, then
>> why not have each one compute the counting of 2-perms in memory.
>>
>> you would then get the reducer to merge these partial results together.
>>
>> (details are left to the reader ...)
>>
>> Miles
>>
>> 2008/9/19 Sandy <[EMAIL PROTECTED]>:
>> > Hi,
>> >
>> > I have a serious problem that I'm not sure how to fix. I have two M/R
>> phases
>> > that calculates a matrix in parallel. It works... but it's slower than
>> the
>> > serial version (by about 100 times).
>> >
>> > Here is a toy program that works similarly to my application. In this
>> > example I'm having different random numbers being generated, per given
>> line
>> > of input, and then creating a n x n matrix that counts how many of these
>> > random numbers were shared.
>> >
>> > -------------------
>> > first map phase() {
>> > input: key = offset, value = line of text (embedded line number), ln
>> > generate k random numbers, k1 .. kn
>> > emit: <ki, ln >
>> > }
>> >
>> > first reduce phase() {
>> > input: key = ki, value = list(ln)
>> > if list size is greater than one:
>> >   for every 2-permutation p:
>> >      emit: <p, 1>
>> >    //example: list = 1 2 3
>> >    //emit: <(1,2), 1>
>> >    //emit: <(2,3), 1>
>> >    //emit: <(1,3), 1>
>> > }
>> >
>> > second map phase() {
>> > input: key = offset, value = (i, j) 1
>> > //dummy function. acts as a transition to reduce
>> > parse value into two tokens [(i,j)] and [1]
>> > emit: <(i,j), 1>
>> > }
>> >
>> > second reduce() {
>> > input: key = (i,j)  value = list(1)
>> > //wordcount
>> > sum up the list of ones
>> > emit: <(i,j), sum(list(1))>
>> > }
>> > ------------------
>> >
>> > Now here's the problem:
>> >
>> > Let's suppose the file is 27MB.
>> > The time it takes for the first map phase is about 3 minutes.
>> > The time it takes for the first reduce phase is about 1 hour.
>> > The size of the intermediary files that are produced by this first M/R
>> phase
>> > is 48GB.
>> >
>> > The time it takes for the second map phase is 9 hours (and this function
>> is
>> > just a dummy funtion!!)
>> > The time it takes for the second reduce phase is 12 hours
>> >
>> > I have been trying to change the number of maps and reduce tasks, but
>> that
>> > doesn't seem to really chip away at the massive number of 2-permutations
>> > that need to be taken care of in the second M/R phase. At least not on my
>> > current machine.
>> >
>> >
>> > Has anyone implemented a matrix in parallel using MapReduce? If so, Is
>> this
>> > normal or expected behavior? I do realize that I have a small input file,
>> > and that this may impact speedup. The most powerful machine I have to run
>> > this M/R implementation is a MacPro that has two processors, each with
>> four
>> > cores, and 4 different hard disks of 1 TB each.
>> >
>> > Does anyone have any suggestions on what I can change (either with the
>> > approach or the cluster setup -- do I need more machines?) in order to
>> make
>> > this faster? I am current running 8 map tasks and 4 reduce tasks. I am
>> going
>> > to change it 10 map tasks and 9 reduce tasks and see if that helps any,
>> but
>> > I'm seriously wondering if this is not going to give me much of a change
>> > since I only have one machine.
>> >
>> >
>> > Any insight is greatly appreciated.
>> >
>> > Thanks!
>> >
>> > -SM
>> >
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Reply via email to