[ 
https://issues.apache.org/jira/browse/HADOOP-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj Das updated HADOOP-1043:
--------------------------------

    Attachment: 1043.patch

This patch looks at all the available CopyResult objects from the copyResults 
list before querying the JobTracker for new map output locations.

> Optimize the shuffle phase (increase the parallelism)
> -----------------------------------------------------
>
>                 Key: HADOOP-1043
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1043
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>         Attachments: 1043.patch
>
>
> In the current shuffle code, only one map output location node is accessed 
> from any Reduce at any given point of time. For example, if a particular 
> node, say machine1.foo.com ran 300 maps, the reducer would fetch just one 
> output from there at a time. machine1.foo.com will be inserted into a Set 
> datastructure (uniqueHosts) and until it gets removed from there, no other 
> map output will be fetched from that machine. The fact that only one map 
> output is fetched at a time from any particular host seems fine, but the 
> logic for removing a node from uniqueHosts is such that there could be a lot 
> of delay before a node gets deleted from the Set datastructure (even after 
> the map output has been fetched from that node). This probably leads to 
> suboptimal performance since it reduces the parallelism in fetching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to