[
https://issues.apache.org/jira/browse/HADOOP-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820244#comment-13820244
]
Kihwal Lee commented on HADOOP-9956:
------------------------------------
bq. Closing idle connections might be the only option, if you don't want client
to DoS server trivially, accidental or not, by opening too many idle
connections. If an application protocol cares about idempotence, the
application should handle it, i.e., we should fix job client to avoid
submitting duplicate jobs. Otherwise many network issues will cause the same
problem. We can even make it a little more client friendly by respond with an
empty RPC frame with a busy code before closing the connection.
I fully agree. The limitation exists even today and we need to have a way for
rpc server to better protect itself. As suggested above, it will be nice if
server can make client cooperate by sending back something like EBUSY. If done
right, this can spread out sharp peaks. Also capping the number of allowed
connections may be necessary. But this is beyond the scope of this jira.
[~daryn], would you file a jira for addressing this issue?
As for the patch, the change looks good to me. +1.
> RPC listener inefficiently assigns connections to readers
> ---------------------------------------------------------
>
> Key: HADOOP-9956
> URL: https://issues.apache.org/jira/browse/HADOOP-9956
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: ipc
> Affects Versions: 2.0.0-alpha, 3.0.0
> Reporter: Daryn Sharp
> Assignee: Daryn Sharp
> Attachments: HADOOP-9956.branch-23.patch, HADOOP-9956.patch,
> HADOOP-9956.patch
>
>
> The socket listener and readers use a complex synchronization to update the
> reader's NIO {{Selector}}. Updating active selectors is not thread-safe so
> precautions are required.
> However, the current locking choreography results in a serialized
> distribution of new connections to the parallel socket readers. A
> slower/busier reader can stall the listener and throttle performance.
> The problem manifests as unexpectedly low cpu utilization by the listener and
> readers (~20-30%) under heavy load. The call queue is shallow when it should
> be overflowing.
--
This message was sent by Atlassian JIRA
(v6.1#6144)