[jira] [Commented] (MAPREDUCE-2454) Allow external sorter plugin for MR

Mariappan Asokan (JIRA) Sun, 18 Nov 2012 04:47:11 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13499789#comment-13499789
 ]


Mariappan Asokan commented on MAPREDUCE-2454:
---------------------------------------------

Hi Arun,
  I would like to make the following points:

* We talked about different processing that can happen before the {{Reducer.}}  
Currently, we have a *merge*.  It can be a *sort* as you mentioned or a simple 
*copy* as well.  The *copy* case arises when one wants to avoid sorting that 
happens in the MR data flow.  It would enable hash based aggregation or join in 
the {{Reducer.}}
* Regardless of the processing done or whether shuffle is push or pull based, 
the processing should be in control of driving the processing not the shuffle.  
This is not obvious for a *sort* or *merge*.  For a *copy*, it makes a big 
difference.
* For a *copy*, we want the {{Reducer}} to receive the <key, value> pairs as 
soon as data is shuffled(unlike *sort* or *merge* which has to wait until the 
last <key, value> pair is seen before outputting the first <key, value> pair.)  
There is no need to spill data to disk on the reduce side.
* With the current arrangement where shuffle assumes that the 
processing(*merge*) can return a {{RawKeyValueIterator}} only at the end of 
shuffling, it is impossible to support *copy*.  There is inherent deadlock 
because *copy* wants to return the <key, value> pairs right away whereas shuffle
thinks that it can happen only at the end.
* The change I made is very simple.  It does not alter any semantics and it 
allows the processing to be a *copy* without any deadlock.  In fact, the test I 
created as part of this Jira does a simple *copy* before the {{Reducer.}}

I hope I clarified the reason for the change.

Thanks.

-- Asokan

                
> Allow external sorter plugin for MR
> -----------------------------------
>
>                 Key: MAPREDUCE-2454
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2454
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 2.0.0-alpha, 3.0.0, 2.0.2-alpha
>            Reporter: Mariappan Asokan
>            Assignee: Mariappan Asokan
>            Priority: Minor
>              Labels: features, performance, plugin, sort
>         Attachments: HadoopSortPlugin.pdf, HadoopSortPlugin.pdf, 
> KeyValueIterator.java, MapOutputSorterAbstract.java, MapOutputSorter.java, 
> mapreduce-2454-modified-code.patch, mapreduce-2454-modified-test.patch, 
> mapreduce-2454-new-test.patch, mapreduce-2454.patch, mapreduce-2454.patch, 
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, 
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, 
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, 
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, 
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, 
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, 
> mapreduce-2454.patch, mapreduce-2454-protection-change.patch, 
> mr-2454-on-mr-279-build82.patch.gz, MR-2454-trunkPatchPreview.gz, 
> ReduceInputSorter.java
>
>
> Define interfaces and some abstract classes in the Hadoop framework to 
> facilitate external sorter plugins both on the Map and Reduce sides.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2454) Allow external sorter plugin for MR

Reply via email to