Combiner may reduce the total amount of data transfered across the
network. Even if it does not, the IO size would be bigger and makes
disks perform faster. Yes, we do need to be concerned about the
possible increase of latency, but we can either allow user to control
it or use indicators like cluster load, the progress of other maps
(local or remote), and the duration of map tasks to decide whether
such a delay would be overall beneficial or not.
On Aug 17, 2009, at 5:35 AM, Amogh Vasekar wrote:
Same amount of data will have to be read and transferred over
network, same file or multiple files. If you do merge to a single
file, the S&S phase actually cant start till all mappers have
finished, as opposed to fetching outputs from individual mapper
tasks which can be as soon as it has finished.
Just my two cents.
Amogh
-----Original Message-----
From: Zheng Shao [mailto:[email protected]]
Sent: Monday, August 17, 2009 3:36 AM
To: [email protected]
Subject: RE: merging multiple mapper's outputs
Multiple mapper tasks.
Combiner is independent from this functionality I think. Combiner
merges rows with the same key. It can work on both single mapper
output and multiple mapper outputs together.
Zheng
-----Original Message-----
From: Zhong Wang [mailto:[email protected]]
Sent: Sunday, August 16, 2009 8:42 AM
To: [email protected]
Subject: Re: merging multiple mapper's outputs
On Sun, Aug 16, 2009 at 10:00 AM, Zheng Shao<[email protected]>
wrote:
Does hadoop have the capability of merging multiple mappers(on the
same
node) output into a single one, to speed up the shuffling phase? Is
there a
JIRA that I can find more information about it?
Do you mean outputs from multiple mapper tasks or multiple mapper
functions? Could Combiner help?
--
Zhong Wang