I have an MR application that is running fine except for the performance. Increasing the number of data nodes is not an option to me.
Looking at the source code of MR framework, I noticed that the partitioned output of each mapper is sorted (MapTask.java), and on the reduce side partitions from various mappers are merged (ReduceTask.java) before running the reduce step. Functionally, reducers in my application does not require data to be in sorted order and getting rid of the sort and merge steps in the framework will help my application. Does anyone know, why the sort and merge of intermediate data is being done by the framework? Is there anything - MR functional concepts, framework design etc. - that will need the sort and merge of intermediate data? I want to give a shot in getting rid of the sort and merge steps in the framework and want to know of any potential risks involved. Any input is appreciated. Thanks, Ravi _____________________________________________________________________________ ATTENTION: The information contained in this message (including any files transmitted with this message) may contain proprietary, trade secret or other confidential and/or legally privileged information. Any pricing information contained in this message or in any files transmitted with this message is always confidential and cannot be shared with any third parties without prior written approval from Syncsort. This message is intended to be read only by the individual or entity to whom it is addressed or by their designee. If the reader of this message is not the intended recipient, you are on notice that any use, disclosure, copying or distribution of this message, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or Syncsort and destroy all copies of this message in your possession, custody or control.
