[ 
https://issues.apache.org/jira/browse/HADOOP-9229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557465#comment-13557465
 ] 

Kihwal Lee commented on HADOOP-9229:
------------------------------------

[~tlipcon] In this scenario, setupIOStreams() will throw an exception without 
retrying, because handleSaslConnectionFailure() gives up. If the auth mode is 
kerberos, it will be retried, but that's still outside of setupConnection() 
without involving handleConnectionFailure(). May be we should add a check for 
connection retry policy in handleSaslConnectionFailure().

[~sureshms] We've also seen this happening against AM. Since there are finite 
number of tasks, retrying would have made the job succeed. This failure mode is 
particularly bad since clients fail without retrying. For requests for which 
only one chance is allowed, this is fatal. Since failed jobs get retried, the 
same situation will likely repeat. If all requests are eventually served, the 
load will go away without doing more damage.  

I agree that if this condition is sustained, the cluster has bigger problem and 
no ipc-level actions will solve that. But for transient overloads, we want the 
system to behave more gracefully. One concern is server accepting too much 
connections and running out of FD, which causes all kinds of bad things. This 
can be prevented by HADOOP-9137. 
                
> IPC: Retry on connection reset or socket timeout during SASL negotiation
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-9229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9229
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>    Affects Versions: 3.0.0, 2.0.3-alpha, 0.23.7
>            Reporter: Kihwal Lee
>
> When an RPC server is overloaded, incoming connections may not get accepted 
> in time, causing listen queue overflow. The impact on client varies depending 
> on the type of OS in use. On Linux, connections in this state look fully 
> connected to the clients, but they are without buffers, thus any data sent to 
> the server will get dropped.
> This won't be a problem for protocols where client first wait for server's 
> greeting. Even for clients-speak-first protocols, it will be fine if the 
> overload is transient and such connections are accepted before the 
> retransmission of dropped packets arrive. Otherwise, clients can hit socket 
> timeout after several retransmissions.  In certain situations, connection 
> will get reset while clients still waiting for ack.
> We have seen this happening to IPC clients during SASL negotiation. Since no 
> call has been sent, we should allow retry when connection reset or socket 
> timeout happens in this stage.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to