I'm a relative newbie to Hadoop, but your assumption below is not correct in my 
organization.  It is common for us to call output.collect() more than once in a 
map() function.

Dave Shine


-----Original Message-----
From: elton sky [mailto:eltonsky9...@gmail.com]
Sent: Tuesday, May 03, 2011 4:49 AM
To: common-dev@hadoop.apache.org
Subject: Re: Why mergeParts() is not parallel with collect() on map?

Pls correct me if I am wrong. One of the important assumptions of hadoop map
reduce is: map's output should be smaller than input. So the workload on
reduce should be smaller than map phase. That's why we put sort, spill and
merge all on map side. Reduce just merge sorted output.


> However, typically, the map's merge is much less intensive than the
> reduce's merge. As a result, this might just bloat the code for little gain,
> except in the most extreme cases.

In some cases, if the output of map is bigger than input, there might be
many spill files to be merged.


On Tue, May 3, 2011 at 5:52 PM, Arun C Murthy <a...@yahoo-inc.com> wrote:

> Elton,
>
>
> On May 2, 2011, at 11:30 PM, elton sky wrote:
>
>  In shuffle phase, reduce copies output from map. In parallel, there are
>> InMemoryMerger and OnDiskMerger merge copied files if too many. But on
>> map,
>> the mergeParts*() *happens only after collect() finished. Why don't we
>> parallel spills merging with collect()/sort&spill on map?
>>
>
> Certainly feasible, please feel free to open a jira for the enhancement.
>
> However, typically, the map's merge is much less intensive than the
> reduce's merge. As a result, this might just bloat the code for little gain,
> except in the most extreme cases.
>
> Arun
>
>
>

The information contained in this email message is considered confidential and 
proprietary to the sender and is intended solely for review and use by the 
named recipient. Any unauthorized review, use or distribution is strictly 
prohibited. If you have received this message in error, please advise the 
sender by reply email and delete the message.

Reply via email to