Combiner may reduce the total amount of data transfered across the network. Even if it does not, the IO size would be bigger and makes disks perform faster. Yes, we do need to be concerned about the possible increase of latency, but we can either allow user to control it or use indicators like cluster load, the progress of other maps (local or remote), and the duration of map tasks to decide whether such a delay would be overall beneficial or not.

On Aug 17, 2009, at 5:35 AM, Amogh Vasekar wrote:

Same amount of data will have to be read and transferred over network, same file or multiple files. If you do merge to a single file, the S&S phase actually cant start till all mappers have finished, as opposed to fetching outputs from individual mapper tasks which can be as soon as it has finished.
Just my two cents.

Amogh

-----Original Message-----
From: Zheng Shao [mailto:[email protected]]
Sent: Monday, August 17, 2009 3:36 AM
To: [email protected]
Subject: RE: merging multiple mapper's outputs

Multiple mapper tasks.

Combiner is independent from this functionality I think. Combiner merges rows with the same key. It can work on both single mapper output and multiple mapper outputs together.

Zheng
-----Original Message-----
From: Zhong Wang [mailto:[email protected]]
Sent: Sunday, August 16, 2009 8:42 AM
To: [email protected]
Subject: Re: merging multiple mapper's outputs

On Sun, Aug 16, 2009 at 10:00 AM, Zheng Shao<[email protected]> wrote:
Does hadoop have the capability of merging multiple mappers(on the same node) output into a single one, to speed up the shuffling phase? Is there a
JIRA that I can find more information about it?

Do you mean outputs from multiple mapper tasks or multiple mapper
functions? Could Combiner help?



--
Zhong Wang

Reply via email to