[ 
https://issues.apache.org/jira/browse/HADOOP-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494344
 ] 

Doug Cutting commented on HADOOP-1338:
--------------------------------------

As a performance-enhancement, any patch for this must demonstrate a significant 
performance improvement before it can be committed.  I suspect that simply 
caching a fixed number of connections will not provide a significant 
performance enhancement.  Rather, we might attempt, when contacting a node, to 
transfer all available output that resides on that node.  So, instead of 
randomly shuffling the map output locations, a reduce task might first group 
locations by host, then randomly shuffle, fetching batches from each host.  
However, if the shuffle is already keeping up with the map, even this may not 
improve things much, since each map node may tend to only have a single new 
output at a time.

> Improve the shuffle phase by using the "connection: keep-alive" and doing 
> batch transfers of files
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1338
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1338
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>             Fix For: 0.14.0
>
>
> We should do transfers of map outputs at the granularity of  
> *total-bytes-transferred* rather than the current way of transferring a 
> single file and then closing the connection to the server. A single 
> TaskTracker might have a couple of map output files for a given reduce, and 
> we should transfer multiple of them (upto a certain total size) in a single 
> connection to the TaskTracker. Using HTTP-1.1's keep-alive connection would 
> help since it would keep the connection open for more than one file transfer. 
> We should limit the transfers to a certain size so that we don't hold up a 
> jetty thread indefinitely (and cause timeouts for other clients).
> Overall, this should give us improved performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to