[ 
https://issues.apache.org/jira/browse/HDFS-15885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300972#comment-17300972
 ] 

Ayush Saxena commented on HDFS-15885:
-------------------------------------

Thanx [~hexiaoqiao] for the report, 

If I am catching it correct, This issue surfaced because the call from Router A 
reached the Namenode post RetryCache expiry time, right? If it would have 
reached before that we are safe post the caller context stuff?

 

If so, I think we should ensure that the Router to Namenode call should timeout 
before RetryCache expiry time, that should be configurable I feel? Another 
stuff  we should ensure that if the client dropped off, the Router to Namenode 
connection should also abort.

> RBF: Data loss when Router setup connection timeout
> ---------------------------------------------------
>
>                 Key: HDFS-15885
>                 URL: https://issues.apache.org/jira/browse/HDFS-15885
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: rbf
>            Reporter: Xiaoqiao He
>            Priority: Critical
>
> I have met one corner case which could loss data recently, it is very similar 
> to HDFS-15079.
> Considering the following case:
> A. Client send `create` RPC request to Router A at first, then Router A try 
> to setup new connection for this RPC request to NameNode but not setup 
> connection successfully in time.
> B. Client failover to Router B because request timeout (60s at default IIRC).
> C. Router B run normally (include RPC `create` and `complete`) and return to 
> Client.
> D. After a while (more than 10min), Router A is back working and send 
> `create` to NameNode again, then this file is overwrite and data loss.
> I have to state, we have replaced the ClientId and CallId of RPC with 
> Client's id at Router side rather that generated by Router in my deployment.
> After deep dig, we found that setup connection will cost very long time when 
> meet some network issues. At the worst case, it will take (60 * 3 + 45 * 20) 
> * 5 seconds (far greater than 10min - RetryCache expiry time) for setup 
> connections which is related with `maxRetriesOnSocketTimeouts`, 
>  `connectionTimeout`, `maxRetriesOnSasl` and `rpcTimeout`. In this case, it 
> will not covered by `RetryCache` (10min by default) at NameNode side.
> IMO, we should to offer the basic configuration suggestion for Router 
> (especially for RPC layer) to avoid Data Loss case again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to