[ 
https://issues.apache.org/jira/browse/HADOOP-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12564730#action_12564730
 ] 

devaraj edited comment on HADOOP-2568 at 2/1/08 4:06 AM:
-------------------------------------------------------------

Given the current scheme of things, pulling all the map outputs from any given 
node might help - what we save is the latency in establishing connections to 
the tasktracker hosting the map outputs. We don't save on the random file seeks 
on the map output files. However, the proposal here is that a single process 
copies a number of consecutive map outputs from a node running the 
corresponding reducers (e.g. if a node is running reducers 5,6,7,8, then we 
fetch 4 consecutive outputs from the tasktracker that ran a given map node in 
one file seek since the map output files are organized in a way that, within 
the files, outputs are organized sequentially in the order of reducer IDs). 
Thus we cut the number of seeks by a factor of 4 in the case where we run 4 
reducers per node. What you proposed can be done over and above what this jira 
proposes.

      was (Author: devaraj):
    Given the current scheme of things, pulling all the map outputs from any 
given node might help - what we save is the latency in establishing connections 
to the tasktracker hosting the map outputs. We don't save on the random file 
seeks on the map output files. However, the proposal here is that a single 
process copies a number of consecutive map outputs from a node running the 
corresponding reducers (e.g. if a node is running reducers 5,6,7,8, then we 
fetch 4 consecutive outputs from the tasktracker that ran the maps in one file 
seek since the map output files are organized in a way that within the files 
outputs are organized by the reducer IDs). Thus we cut the number of seeks by a 
factor of 4 in the case where we run 4 reducers per node. What you proposed can 
be done over and above what this jira proposes.
  
> Pin reduces with consecutive IDs to nodes and have a single shuffle task per 
> job per node
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2568
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2568
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.17.0
>
>
> The idea is to reduce disk seeks while fetching the map outputs. If we 
> opportunistically pin reduces with consecutive IDs (like 5, 6, 7 .. 
> max-reduce-tasks on that node) on a node, and have a single shuffle task, we 
> should benefit, if for every fetch, that shuffle task fetches all the outputs 
> for the reduces it is shuffling for. In the case where we have 2 reduces per 
> node, we will decrease the #seeks in the map output files on the map nodes by 
> 50%. Memory usage by that shuffle task would be proportional to the number of 
> reduces it is shuffling for (to account for the number of ramfs instances, 
> one per reduce). But overall it should help. 
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to