Thanks Alex and Ken.
My application does not do aggregation. It mainly does some data cleansing and transformation. So I don't need a combiner. (Also, I don't see why a combiner always needs sorted input; it should be optional and user specified) To take advantage of some optimizations, I need a partitioner - this means most of my application logic is in the reducers and cannot set the # of reducers to 0. Of course, I will be happy if there is a way to set the # of mappers to 0. Ravi From: Ken Goodhope [mailto:[email protected]] Sent: Monday, July 26, 2010 8:00 PM To: [email protected] Subject: Re: Why does the MR framework sorts the mapper output? The combiner needs sorted input. On Mon, Jul 26, 2010 at 1:46 PM, Alex Kozlov <[email protected]> wrote: Hi Ravi, Whether a sort is required is still a point of debate: the primary reason is to collect the entries with the same key, but one can implement MapReduce with hash deduping. The performance advantages/disadvantages are still a subject of debate. If you don't need sorting, you can always implement map-side aggregation though and potentially set the # of reducers to 0. There is no potential risk, but if you want to aggregate results across different mappers you'll get back to the original problem. Alex K On Mon, Jul 26, 2010 at 1:32 PM, Chinni, Ravi <[email protected]> wrote: I have an MR application that is running fine except for the performance. Increasing the number of data nodes is not an option to me. Looking at the source code of MR framework, I noticed that the partitioned output of each mapper is sorted (MapTask.java), and on the reduce side partitions from various mappers are merged (ReduceTask.java) before running the reduce step. Functionally, reducers in my application does not require data to be in sorted order and getting rid of the sort and merge steps in the framework will help my application. Does anyone know, why the sort and merge of intermediate data is being done by the framework? Is there anything - MR functional concepts, framework design etc. - that will need the sort and merge of intermediate data? I want to give a shot in getting rid of the sort and merge steps in the framework and want to know of any potential risks involved. Any input is appreciated. Thanks, Ravi ________________________________________________________________________ _____ ATTENTION: The information contained in this message (including any files transmitted with this message) may contain proprietary, trade secret or other confidential and/or legally privileged information. Any pricing information contained in this message or in any files transmitted with this message is always confidential and cannot be shared with any third parties without prior written approval from Syncsort. This message is intended to be read only by the individual or entity to whom it is addressed or by their designee. If the reader of this message is not the intended recipient, you are on notice that any use, disclosure, copying or distribution of this message, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or Syncsort and destroy all copies of this message in your possession, custody or control.
