[ 
https://issues.apache.org/jira/browse/HADOOP-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590282#action_12590282
 ] 

Devaraj Das commented on HADOOP-3275:
-------------------------------------

bq. That is basically limited by how many http fetches a thread can do per 
second. To further improve shuffling, we may need to consider to use keep alive 
http connection and fetch multiple segments per http connection.

Actually the shuffle client uses the URLConnection class which behind the 
scenes uses http keep-alive connection. I believe it maintains a list of 
connections to hosts for a particular amount of time (I am not sure whether 
there is an upper limit on the number of such connections). The shuffle client, 
however, doesn't take advantage of this. We randomize the map output locations 
that we have and possibly fetch from a host, the connection to which, is not 
there in that list of cached connections. In the meantime, some other cached 
connections might time out. 


> Reduce task does not handle map outputs fetching efficiently 
> -------------------------------------------------------------
>
>                 Key: HADOOP-3275
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3275
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>
> I ran a job just counting the number of records in the input data (with 
> combiner)
> The map phase took about less than 10 minutes.
> But the shuffling took additional 30 minutes!
> After examining the code and experimenting a few tweakings, we discovered the 
> probe_sample_size (50/100) in task tracker is too small. 
> The fetchers just cannot be kept busy. After changing that probing size to 
> 5000, the shuffling time reduce to 13 minutes.
> With that setting, the fetching (30 threads) became bottleneck. That is 
> basically limited by how many http fetches a thread can do per second.
> To further improve shuffling, we may need to consider to use keep alive http 
> connection and fetch multiple segments per http connection.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to