[ 
https://issues.apache.org/jira/browse/HDDS-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346759#comment-17346759
 ] 

Bharat Viswanadham edited comment on HDDS-5228 at 5/18/21, 10:04 AM:
---------------------------------------------------------------------

{quote}Even if RpcClient is shared across threads, they all will have the same 
FailoverProxyProvider. If the 1st thread fails over and discovers the leader 
OM, all the subsequent requests from any thread) will be directed to the 
correct OM. I do not see how the retry count will be exhausted because of 
shared threads. Please let me know if I am missing something here.{quote}

The problem here is we update the currentProxyNodeId in RetryPolicy#shouldRetry.

So, lets say 2 threads both contacting OM1 and if OM1 is down.

T1 updates the proxy to OM2 in RetryPolicy#shouldRetry and updates the proxy in 
proxyDescriptor.
T2 updates the proxy to OM3 RetryPolicy#shouldRetry and updates the proxy in 
proxyDescriptor.

So here if T1 and T2 are running in parallel, once after T1 updates, T2 should 
not update it.
 If there is another thread running it will update again to OM1. And we will be 
contacting same OM again and can exhaust retry counts.


RetryInvocationhandler this case by comparing expected failOverCount and not 
calling performFailOver, but our performFailOver is a no-op and 
currentProxyNodeId is update in shouldRetry, this is the root cause for the 
problem.

This was observed in SCM, as the SCM client is shared across OM RPC Handler 
threads, as we create a single SCM client in OM.
Recently we have fixed this for SCM, for more info refers to this.
https://github.com/apache/ozone/pull/2249#issue-645725169


was (Author: bharatviswa):
{quote}Even if RpcClient is shared across threads, they all will have the same 
FailoverProxyProvider. If the 1st thread fails over and discovers the leader 
OM, all the subsequent requests from any thread) will be directed to the 
correct OM. I do not see how the retry count will be exhausted because of 
shared threads. Please let me know if I am missing something here.{quote}

The problem here is we update the currentProxyNodeId in RetryPolicy#shouldRetry.

So, lets say 2 threads both contacting OM1 and if OM1 is down.

T1 updates the proxy to OM2 in RetryPolicy#shouldRetry and updates the proxy in 
proxyDescriptor.
T2 updates the proxy to OM3 RetryPolicy#shouldRetry and updates the proxy in 
proxyDescriptor.

So here if T1 and T2 are running in parallel, once after T1 updates, T2 should 
not update it.
 If there is another thread running it will update again to OM1. And we will be 
contacting same OM again and can exhaust retry counts.


RetryInvocationhandler this case by comparing expected failOverCount and not 
calling performFailOver, but our performFailOver is a no-op and 
currentProxyNodeId is update in shouldRetry.

This was observed in SCM, as the SCM client is shared across OM RPC Handler 
threads, as we create a single SCM client in OM.
Recently we have fixed this for SCM, for more info refers to this.
https://github.com/apache/ozone/pull/2249#issue-645725169

> Make OM FailOverProxyProvider work across threads
> -------------------------------------------------
>
>                 Key: HDDS-5228
>                 URL: https://issues.apache.org/jira/browse/HDDS-5228
>             Project: Apache Ozone
>          Issue Type: Improvement
>            Reporter: Bharat Viswanadham
>            Assignee: Bharat Viswanadham
>            Priority: Major
>
> Use perform failover for doing perform failover instead of updating proxy in 
> RetryPolocy#shouldRetry.
> With this, if RpcClient shared across threads it will unnecessarily exhaust 
> the retry count. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to