I'm a relative newbie to Hadoop, but your assumption below is not correct in my organization. It is common for us to call output.collect() more than once in a map() function.
Dave Shine -----Original Message----- From: elton sky [mailto:eltonsky9...@gmail.com] Sent: Tuesday, May 03, 2011 4:49 AM To: common-dev@hadoop.apache.org Subject: Re: Why mergeParts() is not parallel with collect() on map? Pls correct me if I am wrong. One of the important assumptions of hadoop map reduce is: map's output should be smaller than input. So the workload on reduce should be smaller than map phase. That's why we put sort, spill and merge all on map side. Reduce just merge sorted output. > However, typically, the map's merge is much less intensive than the > reduce's merge. As a result, this might just bloat the code for little gain, > except in the most extreme cases. In some cases, if the output of map is bigger than input, there might be many spill files to be merged. On Tue, May 3, 2011 at 5:52 PM, Arun C Murthy <a...@yahoo-inc.com> wrote: > Elton, > > > On May 2, 2011, at 11:30 PM, elton sky wrote: > > In shuffle phase, reduce copies output from map. In parallel, there are >> InMemoryMerger and OnDiskMerger merge copied files if too many. But on >> map, >> the mergeParts*() *happens only after collect() finished. Why don't we >> parallel spills merging with collect()/sort&spill on map? >> > > Certainly feasible, please feel free to open a jira for the enhancement. > > However, typically, the map's merge is much less intensive than the > reduce's merge. As a result, this might just bloat the code for little gain, > except in the most extreme cases. > > Arun > > > The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.