[
https://issues.apache.org/jira/browse/HADOOP-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12564930#action_12564930
]
Amar Kamat commented on HADOOP-2568:
------------------------------------
Yes it makes sense. Was wondering how frequent will this scenario be there
(reducers in sequence). How will this be better in the avg-case-performance v/s
complexity added. Is there any particular case where this will be extremely
useful? Can't we have the order in which map output are stored in the final
file determined by the order in which the reducers are present on the host and
now the final-spill file is index by the host-name?
> Pin reduces with consecutive IDs to nodes and have a single shuffle task per
> job per node
> -----------------------------------------------------------------------------------------
>
> Key: HADOOP-2568
> URL: https://issues.apache.org/jira/browse/HADOOP-2568
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Devaraj Das
> Assignee: Devaraj Das
> Fix For: 0.17.0
>
>
> The idea is to reduce disk seeks while fetching the map outputs. If we
> opportunistically pin reduces with consecutive IDs (like 5, 6, 7 ..
> max-reduce-tasks on that node) on a node, and have a single shuffle task, we
> should benefit, if for every fetch, that shuffle task fetches all the outputs
> for the reduces it is shuffling for. In the case where we have 2 reduces per
> node, we will decrease the #seeks in the map output files on the map nodes by
> 50%. Memory usage by that shuffle task would be proportional to the number of
> reduces it is shuffling for (to account for the number of ramfs instances,
> one per reduce). But overall it should help.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.