[
https://issues.apache.org/jira/browse/MAPREDUCE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mariappan Asokan updated MAPREDUCE-2454:
----------------------------------------
Attachment: mapreduce-2454.patch
Hi Arun,
Thank you very much for allotting some time to have a conversation with you
during Strata 2012. Here is the list of items we discussed and how I followed
up in the new patch.
* With YARN, different MR data processing engines can co-exist in addition to
the sort/merge done after map and before reduce. Keeping this in mind, I am
calling the sort plugin interface on the map side as {{PostMapProcessor.}}
Similarly, the merge done on the reduce side will be abstracted as
{{PreReduceProcessor.}}
* The {{PostMapProcessor}} can simply extend the existing
{{MapOutputCollector}} with an {{initialize()}} method. The current
{{MapOutputBuffer}}
in MapTask.java will implement this interface as the default implementation.
* On the reduce side, my suggestion is to define {{PreReduceProcessor}} based
on methods already available in {{MergeManager}} class. With minimal changes,
this will allow {{MergeManager}} to implement {PreReduceProcessor.}}
* There is a concern about exposing some APIs as public. Since the revised
patch is much smaller than the one submitted before(one fourth of the original
patch size), the chance of breaking anything is minimized. Also, I feel that
only a handful of developers will write plugins. I have marked all the exposed
APIs with proper annotations that APIs are not stable and there is a risk using
them. The plugin developers should keep up with the changes in the exposed
APIs. The core Hadoop developers need not worry about maintaining backward
compatibility.
The revised patch can be easily integrated with shuffle plugin.
I repeatedly ran terasort benchmark on a cluster with 55 nodes. The
performance difference with and without the patch was egligible(plus or minus
1%.)
I would like to receive feedback from you and other developers who are watching
this Jira. In the meantime, I am creating a new test to test the plugin.
Thanks.
-- Asokan
> Allow external sorter plugin for MR
> -----------------------------------
>
> Key: MAPREDUCE-2454
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2454
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Affects Versions: 2.0.0-alpha, 3.0.0, 2.0.2-alpha
> Reporter: Mariappan Asokan
> Assignee: Mariappan Asokan
> Priority: Minor
> Labels: features, performance, plugin, sort
> Attachments: HadoopSortPlugin.pdf, HadoopSortPlugin.pdf,
> KeyValueIterator.java, MapOutputSorterAbstract.java, MapOutputSorter.java,
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
> mapreduce-2454.patch, mr-2454-on-mr-279-build82.patch.gz,
> MR-2454-trunkPatchPreview.gz, ReduceInputSorter.java
>
>
> Define interfaces and some abstract classes in the Hadoop framework to
> facilitate external sorter plugins both on the Map and Reduce sides.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira