[
https://issues.apache.org/jira/browse/MAPREDUCE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13499789#comment-13499789
]
Mariappan Asokan commented on MAPREDUCE-2454:
---------------------------------------------
Hi Arun,
I would like to make the following points:
* We talked about different processing that can happen before the {{Reducer.}}
Currently, we have a *merge*. It can be a *sort* as you mentioned or a simple
*copy* as well. The *copy* case arises when one wants to avoid sorting that
happens in the MR data flow. It would enable hash based aggregation or join in
the {{Reducer.}}
* Regardless of the processing done or whether shuffle is push or pull based,
the processing should be in control of driving the processing not the shuffle.
This is not obvious for a *sort* or *merge*. For a *copy*, it makes a big
difference.
* For a *copy*, we want the {{Reducer}} to receive the <key, value> pairs as
soon as data is shuffled(unlike *sort* or *merge* which has to wait until the
last <key, value> pair is seen before outputting the first <key, value> pair.)
There is no need to spill data to disk on the reduce side.
* With the current arrangement where shuffle assumes that the
processing(*merge*) can return a {{RawKeyValueIterator}} only at the end of
shuffling, it is impossible to support *copy*. There is inherent deadlock
because *copy* wants to return the <key, value> pairs right away whereas shuffle
thinks that it can happen only at the end.
* The change I made is very simple. It does not alter any semantics and it
allows the processing to be a *copy* without any deadlock. In fact, the test I
created as part of this Jira does a simple *copy* before the {{Reducer.}}
I hope I clarified the reason for the change.
Thanks.
-- Asokan
> Allow external sorter plugin for MR
> -----------------------------------
>
> Key: MAPREDUCE-2454
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2454
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Affects Versions: 2.0.0-alpha, 3.0.0, 2.0.2-alpha
> Reporter: Mariappan Asokan
> Assignee: Mariappan Asokan
> Priority: Minor
> Labels: features, performance, plugin, sort
> Attachments: HadoopSortPlugin.pdf, HadoopSortPlugin.pdf,
> KeyValueIterator.java, MapOutputSorterAbstract.java, MapOutputSorter.java,
> mapreduce-2454-modified-code.patch, mapreduce-2454-modified-test.patch,
> mapreduce-2454-new-test.patch, mapreduce-2454.patch, mapreduce-2454.patch,
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
> mapreduce-2454.patch, mapreduce-2454-protection-change.patch,
> mr-2454-on-mr-279-build82.patch.gz, MR-2454-trunkPatchPreview.gz,
> ReduceInputSorter.java
>
>
> Define interfaces and some abstract classes in the Hadoop framework to
> facilitate external sorter plugins both on the Map and Reduce sides.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira