[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13475913#comment-13475913
 ] 

Mariappan Asokan commented on MAPREDUCE-2454:
---------------------------------------------

Hi Arun,
  Thanks for your feedback.  Though I have confidence in my contribution(I have 
been running Terasort without any problems for data sizes of 2 TB on a small 
cluster), I understand your concerns on the size of the patch.  I can think of 
the following sub-steps each addressed in a different Jira:

* Refactor {{Task.java}} so that the classes {{ValuesIterator}}, 
{{CombinerRunner}}, and {{CombineValuesIterator}}, {{CombineOutputCollector}} 
can be taken out to separate
files.

* Refactor {{MapOutput.java}} to create {{InMemoryMapOutput}} and 
{{OnDiskMapOutput}} classes.

* Refactor {{Shuffle.java}} and {{MergeManager.java}} to decouple shuffle and 
merge.  This should also allow one to make shuffle pluggable.  There will be a 
small change to {{ReduceTask.java}} as part of this decoupling since 
{{ReduceTask}} will instantiate both {{Shuffle}} and {{MergeManager}} objects.

* Refactor {{MapTask.java}} so that the code related to sort on the map side is 
moved to a new file {{MapSort.java}}.  Introduce {{SortinRecordWriter}} and 
{{SortoutRecordReader}} classes as part of this refactoring.

* Refactor {{ReduceTask.java}} so that merge related code is moved to a new 
file {{ReduceSort.java}}.

* Define corresponding interfaces for {{MapSort}} and {{ReduceSort}} classes 
and make these implementations pluggable.

How does the above sequence of changes sound to you?  I can raise separate 
Jiras for each one.  We can keep these changes in a separate branch before 
moving to the trunk if you wish.

If you have other suggestions, please let me know.

Thanks again.

-- Asokan

                
> Allow external sorter plugin for MR
> -----------------------------------
>
>                 Key: MAPREDUCE-2454
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2454
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 2.0.0-alpha, 3.0.0, 2.0.2-alpha
>            Reporter: Mariappan Asokan
>            Assignee: Mariappan Asokan
>            Priority: Minor
>              Labels: features, performance, plugin, sort
>         Attachments: HadoopSortPlugin.pdf, HadoopSortPlugin.pdf, 
> KeyValueIterator.java, MapOutputSorterAbstract.java, MapOutputSorter.java, 
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, 
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, 
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, 
> mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, 
> mr-2454-on-mr-279-build82.patch.gz, MR-2454-trunkPatchPreview.gz, 
> ReduceInputSorter.java
>
>
> Define interfaces and some abstract classes in the Hadoop framework to 
> facilitate external sorter plugins both on the Map and Reduce sides.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to