[
https://issues.apache.org/jira/browse/HADOOP-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12564724#action_12564724
]
Amar Kamat commented on HADOOP-2568:
------------------------------------
Should we try doing this in steps as in
1) Try fetching all the map outputs for a reducer from one node in one shot.
2) Then extract the shuffler from each reducer and have a common shuffler for
all the reducers on that node. This is having 3 tasks {{mapper, shuffler,
reducer}}, no?
Having a shuffler will be a big change in terms of code and design while
combining/piggy-bagging the map outputs for one reducer will be comparatively
smaller.
Thoughts?
> Pin reduces with consecutive IDs to nodes and have a single shuffle task per
> job per node
> -----------------------------------------------------------------------------------------
>
> Key: HADOOP-2568
> URL: https://issues.apache.org/jira/browse/HADOOP-2568
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Devaraj Das
> Assignee: Devaraj Das
> Fix For: 0.17.0
>
>
> The idea is to reduce disk seeks while fetching the map outputs. If we
> opportunistically pin reduces with consecutive IDs (like 5, 6, 7 ..
> max-reduce-tasks on that node) on a node, and have a single shuffle task, we
> should benefit, if for every fetch, that shuffle task fetches all the outputs
> for the reduces it is shuffling for. In the case where we have 2 reduces per
> node, we will decrease the #seeks in the map output files on the map nodes by
> 50%. Memory usage by that shuffle task would be proportional to the number of
> reduces it is shuffling for (to account for the number of ramfs instances,
> one per reduce). But overall it should help.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.