[ https://issues.apache.org/jira/browse/HADOOP-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Devaraj Das updated HADOOP-1043: -------------------------------- Attachment: 1043.patch This patch looks at all the available CopyResult objects from the copyResults list before querying the JobTracker for new map output locations. > Optimize the shuffle phase (increase the parallelism) > ----------------------------------------------------- > > Key: HADOOP-1043 > URL: https://issues.apache.org/jira/browse/HADOOP-1043 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Reporter: Devaraj Das > Assigned To: Devaraj Das > Attachments: 1043.patch > > > In the current shuffle code, only one map output location node is accessed > from any Reduce at any given point of time. For example, if a particular > node, say machine1.foo.com ran 300 maps, the reducer would fetch just one > output from there at a time. machine1.foo.com will be inserted into a Set > datastructure (uniqueHosts) and until it gets removed from there, no other > map output will be fetched from that machine. The fact that only one map > output is fetched at a time from any particular host seems fine, but the > logic for removing a node from uniqueHosts is such that there could be a lot > of delay before a node gets deleted from the Set datastructure (even after > the map output has been fetched from that node). This probably leads to > suboptimal performance since it reduces the parallelism in fetching. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.