[ 
https://issues.apache.org/jira/browse/HADOOP-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837512#action_12837512
 ] 

Konstantin Shvachko commented on HADOOP-1849:
---------------------------------------------

> Did you recently come across a case where adjusting this value improves 
> things? 

Yes, as we observed recently the NN was viewed from the outside as 
unresponsive, that is clients, JT, TTs could not connect to it. If you look at 
the name-node at the same time it was working fine: no GC, average cpu usage. 
It turned out some clients were doing listStatus on large directories, and 
while processing the listStatus NN rpc server maxed out on the call queue, 
which did not let others to connect. We wanted to make the queue large enough 
so that (almost) everybody could connect and wait in the queue rather than 
retrying. The only way to increase the queue size currently is to increase the 
handler count, which was done. The idea here is to experiment with the queue 
size and the handler count independently of each other.
I agree on making it undocumented for the two reasons I mentioned above. 
Suresh, Hairong, Sanjay think it is important to have it documented. You might 
have better arguments.
Another thing, we keep thinking about the queue size in per handler terms. 
Should we rather specify the total queue size, and eliminate the pseudo 
dependency on the handler count? The *.handler.count parameters are not in 
common, it is hard to correlate handlers with queue sizes.

> IPC server max queue size should be configurable
> ------------------------------------------------
>
>                 Key: HADOOP-1849
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1849
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>            Reporter: Raghu Angadi
>            Assignee: Konstantin Shvachko
>         Attachments: handlerQueueSizeConfig.patch, 
> handlerQueueSizeConfig.patch
>
>
> Currently max queue size for IPC server is set to (100 * handlers). Usually 
> when RPC failures are observed (e.g. HADOOP-1763), we increase number of 
> handlers and the problem goes away. I think a big part of such a fix is 
> increase in max queue size. I think we should make maxQsize per handler 
> configurable (with a bigger default than 100). There are other improvements 
> also (HADOOP-1841).
> Server keeps reading RPC requests from clients. When the number in-flight 
> RPCs is larger than maxQsize, the earliest RPCs are deleted. This is the main 
> feedback Server has for the client. I have often heard from users that Hadoop 
> doesn't handle bursty traffic.
> Say handler count is 10 (default) and Server can handle 1000 RPCs a sec 
> (quite conservative/low for a typical server), it implies that an RPC can 
> wait for only for 1 sec before it is dropped. If there 3000 clients and all 
> of them send RPCs around the same time (not very rare, with heartbeats etc), 
> 2000 will be dropped. In stead of dropping the earliest RPCs, if the server 
> delays reading new RPCs, the feedback to clients would be much smoother, I 
> will file another jira regd queue management.
> For this jira I propose to make queue size per handler configurable, with a 
> larger default (may be 500).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to