[ 
https://issues.apache.org/jira/browse/HADOOP-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649084#action_12649084
 ] 

Devaraj Das commented on HADOOP-1338:
-------------------------------------

Owen, I am not too sure if HTTP's keep-alive will be always useful in our 
scenario. The reason for this is that we fetch from a random host each time. 
HTTP's keep-alive would keep the connection alive for only a certain time I 
believe. Also, if the keep-alive timeout were configurable and we could set a 
higher timeout, for scalability reasons (imagine big clusters containing 1000s 
of nodes), we cannot have a server keep too many client connections alive at 
the same time...
My gut is that we should see some benefits of pulling multiple map outputs per 
HTTP request. If we could use PipeLining, we would have to be careful that we 
don't pull too many outputs in one go since that might starve other clients. 
But the question is does that Java or Apache HTTPClient support PipeLining. If 
not, we would need to build the protocol over the regular HTTP request protocol 
(something along the lines Runping suggested). But seemed to me that it 
interferes with the inMemory shuffle in subtle ways and that part needs to be 
thought about.
But yes, overall this requires good amount of benchmarking...

> Improve the shuffle phase by using the "connection: keep-alive" and doing 
> batch transfers of files
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1338
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1338
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>
> We should do transfers of map outputs at the granularity of  
> *total-bytes-transferred* rather than the current way of transferring a 
> single file and then closing the connection to the server. A single 
> TaskTracker might have a couple of map output files for a given reduce, and 
> we should transfer multiple of them (upto a certain total size) in a single 
> connection to the TaskTracker. Using HTTP-1.1's keep-alive connection would 
> help since it would keep the connection open for more than one file transfer. 
> We should limit the transfers to a certain size so that we don't hold up a 
> jetty thread indefinitely (and cause timeouts for other clients).
> Overall, this should give us improved performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to