[ 
https://issues.apache.org/jira/browse/HADOOP-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820244#comment-13820244
 ] 

Kihwal Lee commented on HADOOP-9956:
------------------------------------

bq. Closing idle connections might be the only option, if you don't want client 
to DoS server trivially, accidental or not, by opening too many idle 
connections. If an application protocol cares about idempotence, the 
application should handle it, i.e., we should fix job client to avoid 
submitting duplicate jobs. Otherwise many network issues will cause the same 
problem. We can even make it a little more client friendly by respond with an 
empty RPC frame with a busy code before closing the connection.

I fully agree. The limitation exists even today and we need to have a way for 
rpc server to better protect itself. As suggested above, it will be nice if 
server can make client cooperate by sending back something like EBUSY. If done 
right, this can spread out sharp peaks. Also capping the number of allowed 
connections may be necessary.  But this is beyond the scope of this jira.  
[~daryn], would you file a jira for addressing this issue?

As for the patch, the change looks good to me. +1.

> RPC listener inefficiently assigns connections to readers
> ---------------------------------------------------------
>
>                 Key: HADOOP-9956
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9956
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: ipc
>    Affects Versions: 2.0.0-alpha, 3.0.0
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>         Attachments: HADOOP-9956.branch-23.patch, HADOOP-9956.patch, 
> HADOOP-9956.patch
>
>
> The socket listener and readers use a complex synchronization to update the 
> reader's NIO {{Selector}}.  Updating active selectors is not thread-safe so 
> precautions are required.
> However, the current locking choreography results in a serialized 
> distribution of new connections to the parallel socket readers.  A 
> slower/busier reader can stall the listener and throttle performance.
> The problem manifests as unexpectedly low cpu utilization by the listener and 
> readers (~20-30%) under heavy load.  The call queue is shallow when it should 
> be overflowing.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to