Why does the MR framework sorts the mapper output?

Chinni, Ravi Mon, 26 Jul 2010 13:33:38 -0700

I have an MR application that is running fine except for the
performance. Increasing the number of data nodes is not an option to me.




Looking at the source code of MR framework, I noticed that the
partitioned output of each mapper is sorted (MapTask.java), and on the
reduce side partitions from various mappers are merged (ReduceTask.java)
before running the reduce step. Functionally, reducers in my application
does not require data to be in sorted order and getting rid of the sort
and merge steps in the framework will help my application.



Does anyone know, why the sort and merge of intermediate data is being
done by the framework? Is there anything - MR functional concepts,
framework design etc. - that will need the sort and merge of
intermediate data? I want to give a shot in getting rid of the sort and
merge steps in the framework and want to know of any potential risks
involved.



Any input is appreciated.



Thanks,

Ravi





_____________________________________________________________________________

ATTENTION:

The information contained in this message (including any files transmitted 
with this message) may contain proprietary, trade secret or other 
confidential and/or legally privileged information. Any pricing 
information contained in this message or in any files transmitted with 
this message is always confidential and cannot be shared with any third 
parties without prior written approval from Syncsort. This message is 
intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any use, disclosure, copying or 
distribution of this message, in any form, is strictly prohibited. If you 
have received this message in error, please immediately notify the sender 
and/or Syncsort and destroy all copies of this message in your possession, 
custody or control.

Why does the MR framework sorts the mapper output?

Reply via email to