[jira] [Commented] (HDFS-15885) RBF: Data loss when Router setup connection timeout

Xiaoqiao He (Jira) Sun, 14 Mar 2021 00:23:06 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-15885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301080#comment-17301080
 ]


Xiaoqiao He commented on HDFS-15885:
------------------------------------

{quote}This issue surfaced because the call from Router A reached the Namenode 
post RetryCache expiry time, right? If it would have reached before that we are 
safe post the caller context stuff?{quote}
Yes. it is true. Based on my observation, the root cause is long time to setup 
the connection from Router to NameNode due to some network issue. So my 
temporary solution is shorten the parameter about 
`maxRetriesOnSocketTimeouts`,`connectionTimeout`, `maxRetriesOnSasl` , 
`rpcTimeout` to avoid duplicated RPC request to target NameNode after 
RetryCache expiry time. 
{quote}Another stuff  we should ensure that if the client dropped off, the 
Router to Namenode connection should also abort.{quote}
I am concerned this connection is multiplexed by different Clients which have 
the same UGI and the target NameNode, so any other issues if we abort it 
directly? I do not think about it deeply. Maybe it is one choice. Thanks.

> RBF: Data loss when Router setup connection timeout
> ---------------------------------------------------
>
>                 Key: HDFS-15885
>                 URL: https://issues.apache.org/jira/browse/HDFS-15885
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: rbf
>            Reporter: Xiaoqiao He
>            Priority: Critical
>
> I have met one corner case which could loss data recently, it is very similar 
> to HDFS-15079.
> Considering the following case:
> A. Client send `create` RPC request to Router A at first, then Router A try 
> to setup new connection for this RPC request to NameNode but not setup 
> connection successfully in time.
> B. Client failover to Router B because request timeout (60s at default IIRC).
> C. Router B run normally (include RPC `create` and `complete`) and return to 
> Client.
> D. After a while (more than 10min), Router A is back working and send 
> `create` to NameNode again, then this file is overwrite and data loss.
> I have to state, we have replaced the ClientId and CallId of RPC with 
> Client's id at Router side rather that generated by Router in my deployment.
> After deep dig, we found that setup connection will cost very long time when 
> meet some network issues. At the worst case, it will take (60 * 3 + 45 * 20) 
> * 5 seconds (far greater than 10min - RetryCache expiry time) for setup 
> connections which is related with `maxRetriesOnSocketTimeouts`, 
>  `connectionTimeout`, `maxRetriesOnSasl` and `rpcTimeout`. In this case, it 
> will not covered by `RetryCache` (10min by default) at NameNode side.
> IMO, we should to offer the basic configuration suggestion for Router 
> (especially for RPC layer) to avoid Data Loss case again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-15885) RBF: Data loss when Router setup connection timeout

Reply via email to