[
https://issues.apache.org/jira/browse/HADOOP-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13766692#comment-13766692
]
Daryn Sharp commented on HADOOP-9956:
-------------------------------------
New connections already have intrinsically higher priority. All connection
channels are in the reader's selector which only returns channels ready for
reading - ie. not idle.
Prematurely closing sockets is a very bad idea. I've thought through various
approaches, and closing is the worst. Closing a socket in between calls is ok
because the client will reconnect on the next call. The problem is if a client
has a request in flight on the network, or the server received it but the
reader just hasn't serviced it yet. The client has no option but to throw an
exception because it doesn't know if the call is idempotent.
A lot of effort has been spent to address idempotent issues for HA NNs, but
other rpc clients won't gracefully handle the case. Imagine if sockets kept
getting closed on job submissions to a heavily loaded RM that is aggressively
closing connections. A workflow manager like oozie will resubmit duplicate
jobs.
> RPC listener inefficiently assigns connections to readers
> ---------------------------------------------------------
>
> Key: HADOOP-9956
> URL: https://issues.apache.org/jira/browse/HADOOP-9956
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: ipc
> Affects Versions: 2.0.0-alpha, 3.0.0
> Reporter: Daryn Sharp
> Assignee: Daryn Sharp
> Attachments: HADOOP-9956.patch
>
>
> The socket listener and readers use a complex synchronization to update the
> reader's NIO {{Selector}}. Updating active selectors is not thread-safe so
> precautions are required.
> However, the current locking choreography results in a serialized
> distribution of new connections to the parallel socket readers. A
> slower/busier reader can stall the listener and throttle performance.
> The problem manifests as unexpectedly low cpu utilization by the listener and
> readers (~20-30%) under heavy load. The call queue is shallow when it should
> be overflowing.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira