[GitHub] [ozone] bharatviswa504 opened a new pull request #2247: Hdds 5216: Fix race condition causing failOverProxy which is causing failover wrongly.

GitBox Thu, 13 May 2021 22:08:21 -0700


bharatviswa504 opened a new pull request #2247:
URL: https://github.com/apache/ozone/pull/2247



   ## What changes were proposed in this pull request?
   
   In OM SCM client is shared across RpcHandler threads.
   Where we have observed that failOver across multiple threads causing 
failover to happen incorrectly on same SCM and is exhausting retry count.
   
   And also one thing I have observed is
   
   If we observe the error is no route to scm3, but retry happened on 
scm1/172.31.0.9:9863
   ```
   2021-05-11 05:59:53,202 [IPC Server handler 10 on default port 9862] INFO 
retry.RetryInvocationHandler: com.google.protobuf.ServiceException: 
java.net.NoRouteToHostException: No Route to Host from  om1/172.31.0.11 to 
scm3:9863 failed on socket timeout exception: java.net.NoRouteToHostException: 
No route to host; For more details see:  
http://wiki.apache.org/hadoop/NoRouteToHost, while invoking $Proxy32.send over 
nodeId=scm1,nodeAddress=scm1/172.31.0.9:9863 after 9 failover attempts. Trying 
to failover after sleeping for 2000ms.
   If we observe the error is no route to scm3, but retry happened on 
scm1/172.31.0.9:9863
   ```
   If we observe the error is no route to scm3, but retry happened on 
scm1/172.31.0.9:9863
   ```
   2021-05-11 05:59:59,345 [IPC Server handler 10 on default port 9862] WARN 
ipc.Client: Address change detected. Old: scm3/172.31.0.5:9863 New: scm3:9863
   2021-05-11 05:59:59,347 [IPC Server handler 10 on default port 9862] INFO 
retry.RetryInvocationHandler: com.google.protobuf.ServiceException: 
java.net.NoRouteToHostException: No Route to Host from  om1/172.31.0.11 to 
scm3:9863 failed on socket timeout exception: java.net.NoRouteToHostException: 
No route to host; For more details see:  
http://wiki.apache.org/hadoop/NoRouteToHost, while invoking $Proxy32.send over 
nodeId=scm2,nodeAddress=scm2/172.31.0.6:9863 after 10 failover attempts. Trying 
to failover after sleeping for 2000ms.
   ```
   This is because our performFailOver is a no-op and if failOver is needed we 
update currentSCMProxyNodeID in shouldRetry in RetryPolicy. 
   
   **For example**
   2 Threads contacted SCM3, and got NoRouteToHostException, so shouldRetry 
from first thread will move the currentSCMProxyNodeID to scm1 and other thread, 
after this move currentSCMProxyNodeID to scm2. 
   
   Hadoop Proxy RetryInvocationHandler already takes care of if there is 
another thread trying to perform failOver it will not call performFailOver 
again. We shall see below WARN message, and get the currentProxy and contact 
that node.
   
   **om3_1       | 2021-05-14 05:04:28,699 [IPC Server handler 34 on default 
port 9862] WARN retry.RetryInvocationHandler: A failover has occurred since the 
start of call #24329 $Proxy32.send over 
nodeId=scm3,nodeAddress=scm3/192.168.0.6:9863**
   
   Solution here is to use performFailOver to update scmNodeID instead of using 
shouldRetry to update currentSCMProxyNodeID.
   
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-5216
   
   ## How was this patch tested?
   Tested locally, and now observed that it will not perform failOver again and 
exhausting retry counts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [ozone] bharatviswa504 opened a new pull request #2247: Hdds 5216: Fix race condition causing failOverProxy which is causing failover wrongly.

Reply via email to