[ 
https://issues.apache.org/jira/browse/HADOOP-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13766692#comment-13766692
 ] 

Daryn Sharp commented on HADOOP-9956:
-------------------------------------

New connections already have intrinsically higher priority.  All connection 
channels are in the reader's selector which only returns channels ready for 
reading - ie. not idle.

Prematurely closing sockets is a very bad idea.  I've thought through various 
approaches, and closing is the worst.  Closing a socket in between calls is ok 
because the client will reconnect on the next call.  The problem is if a client 
has a request in flight on the network, or the server received it but the 
reader just hasn't serviced it yet.  The client has no option but to throw an 
exception because it doesn't know if the call is idempotent.

A lot of effort has been spent to address idempotent issues for HA NNs, but 
other rpc clients won't gracefully handle the case.  Imagine if sockets kept 
getting closed on job submissions to a heavily loaded RM that is aggressively 
closing connections.  A workflow manager like oozie will resubmit duplicate 
jobs.
                
> RPC listener inefficiently assigns connections to readers
> ---------------------------------------------------------
>
>                 Key: HADOOP-9956
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9956
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: ipc
>    Affects Versions: 2.0.0-alpha, 3.0.0
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>         Attachments: HADOOP-9956.patch
>
>
> The socket listener and readers use a complex synchronization to update the 
> reader's NIO {{Selector}}.  Updating active selectors is not thread-safe so 
> precautions are required.
> However, the current locking choreography results in a serialized 
> distribution of new connections to the parallel socket readers.  A 
> slower/busier reader can stall the listener and throttle performance.
> The problem manifests as unexpectedly low cpu utilization by the listener and 
> readers (~20-30%) under heavy load.  The call queue is shallow when it should 
> be overflowing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to