bharatviswa504 opened a new pull request #2247: URL: https://github.com/apache/ozone/pull/2247
## What changes were proposed in this pull request? In OM SCM client is shared across RpcHandler threads. Where we have observed that failOver across multiple threads causing failover to happen incorrectly on same SCM and is exhausting retry count. And also one thing I have observed is If we observe the error is no route to scm3, but retry happened on scm1/172.31.0.9:9863 ``` 2021-05-11 05:59:53,202 [IPC Server handler 10 on default port 9862] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.NoRouteToHostException: No Route to Host from om1/172.31.0.11 to scm3:9863 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost, while invoking $Proxy32.send over nodeId=scm1,nodeAddress=scm1/172.31.0.9:9863 after 9 failover attempts. Trying to failover after sleeping for 2000ms. If we observe the error is no route to scm3, but retry happened on scm1/172.31.0.9:9863 ``` If we observe the error is no route to scm3, but retry happened on scm1/172.31.0.9:9863 ``` 2021-05-11 05:59:59,345 [IPC Server handler 10 on default port 9862] WARN ipc.Client: Address change detected. Old: scm3/172.31.0.5:9863 New: scm3:9863 2021-05-11 05:59:59,347 [IPC Server handler 10 on default port 9862] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.NoRouteToHostException: No Route to Host from om1/172.31.0.11 to scm3:9863 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost, while invoking $Proxy32.send over nodeId=scm2,nodeAddress=scm2/172.31.0.6:9863 after 10 failover attempts. Trying to failover after sleeping for 2000ms. ``` This is because our performFailOver is a no-op and if failOver is needed we update currentSCMProxyNodeID in shouldRetry in RetryPolicy. **For example** 2 Threads contacted SCM3, and got NoRouteToHostException, so shouldRetry from first thread will move the currentSCMProxyNodeID to scm1 and other thread, after this move currentSCMProxyNodeID to scm2. Hadoop Proxy RetryInvocationHandler already takes care of if there is another thread trying to perform failOver it will not call performFailOver again. We shall see below WARN message, and get the currentProxy and contact that node. **om3_1 | 2021-05-14 05:04:28,699 [IPC Server handler 34 on default port 9862] WARN retry.RetryInvocationHandler: A failover has occurred since the start of call #24329 $Proxy32.send over nodeId=scm3,nodeAddress=scm3/192.168.0.6:9863** Solution here is to use performFailOver to update scmNodeID instead of using shouldRetry to update currentSCMProxyNodeID. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-5216 ## How was this patch tested? Tested locally, and now observed that it will not perform failOver again and exhausting retry counts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
