[
https://issues.apache.org/jira/browse/MAPREDUCE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13475913#comment-13475913
]
Mariappan Asokan commented on MAPREDUCE-2454:
---------------------------------------------
Hi Arun,
Thanks for your feedback. Though I have confidence in my contribution(I have
been running Terasort without any problems for data sizes of 2 TB on a small
cluster), I understand your concerns on the size of the patch. I can think of
the following sub-steps each addressed in a different Jira:
* Refactor {{Task.java}} so that the classes {{ValuesIterator}},
{{CombinerRunner}}, and {{CombineValuesIterator}}, {{CombineOutputCollector}}
can be taken out to separate
files.
* Refactor {{MapOutput.java}} to create {{InMemoryMapOutput}} and
{{OnDiskMapOutput}} classes.
* Refactor {{Shuffle.java}} and {{MergeManager.java}} to decouple shuffle and
merge. This should also allow one to make shuffle pluggable. There will be a
small change to {{ReduceTask.java}} as part of this decoupling since
{{ReduceTask}} will instantiate both {{Shuffle}} and {{MergeManager}} objects.
* Refactor {{MapTask.java}} so that the code related to sort on the map side is
moved to a new file {{MapSort.java}}. Introduce {{SortinRecordWriter}} and
{{SortoutRecordReader}} classes as part of this refactoring.
* Refactor {{ReduceTask.java}} so that merge related code is moved to a new
file {{ReduceSort.java}}.
* Define corresponding interfaces for {{MapSort}} and {{ReduceSort}} classes
and make these implementations pluggable.
How does the above sequence of changes sound to you? I can raise separate
Jiras for each one. We can keep these changes in a separate branch before
moving to the trunk if you wish.
If you have other suggestions, please let me know.
Thanks again.
-- Asokan
> Allow external sorter plugin for MR
> -----------------------------------
>
> Key: MAPREDUCE-2454
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2454
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Affects Versions: 2.0.0-alpha, 3.0.0, 2.0.2-alpha
> Reporter: Mariappan Asokan
> Assignee: Mariappan Asokan
> Priority: Minor
> Labels: features, performance, plugin, sort
> Attachments: HadoopSortPlugin.pdf, HadoopSortPlugin.pdf,
> KeyValueIterator.java, MapOutputSorterAbstract.java, MapOutputSorter.java,
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
> mr-2454-on-mr-279-build82.patch.gz, MR-2454-trunkPatchPreview.gz,
> ReduceInputSorter.java
>
>
> Define interfaces and some abstract classes in the Hadoop framework to
> facilitate external sorter plugins both on the Map and Reduce sides.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira