[
https://issues.apache.org/jira/browse/HDFS-15885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300972#comment-17300972
]
Ayush Saxena commented on HDFS-15885:
-------------------------------------
Thanx [~hexiaoqiao] for the report,
If I am catching it correct, This issue surfaced because the call from Router A
reached the Namenode post RetryCache expiry time, right? If it would have
reached before that we are safe post the caller context stuff?
If so, I think we should ensure that the Router to Namenode call should timeout
before RetryCache expiry time, that should be configurable I feel? Another
stuff we should ensure that if the client dropped off, the Router to Namenode
connection should also abort.
> RBF: Data loss when Router setup connection timeout
> ---------------------------------------------------
>
> Key: HDFS-15885
> URL: https://issues.apache.org/jira/browse/HDFS-15885
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: rbf
> Reporter: Xiaoqiao He
> Priority: Critical
>
> I have met one corner case which could loss data recently, it is very similar
> to HDFS-15079.
> Considering the following case:
> A. Client send `create` RPC request to Router A at first, then Router A try
> to setup new connection for this RPC request to NameNode but not setup
> connection successfully in time.
> B. Client failover to Router B because request timeout (60s at default IIRC).
> C. Router B run normally (include RPC `create` and `complete`) and return to
> Client.
> D. After a while (more than 10min), Router A is back working and send
> `create` to NameNode again, then this file is overwrite and data loss.
> I have to state, we have replaced the ClientId and CallId of RPC with
> Client's id at Router side rather that generated by Router in my deployment.
> After deep dig, we found that setup connection will cost very long time when
> meet some network issues. At the worst case, it will take (60 * 3 + 45 * 20)
> * 5 seconds (far greater than 10min - RetryCache expiry time) for setup
> connections which is related with `maxRetriesOnSocketTimeouts`,
> `connectionTimeout`, `maxRetriesOnSasl` and `rpcTimeout`. In this case, it
> will not covered by `RetryCache` (10min by default) at NameNode side.
> IMO, we should to offer the basic configuration suggestion for Router
> (especially for RPC layer) to avoid Data Loss case again.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]