[
https://issues.apache.org/jira/browse/HDDS-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346759#comment-17346759
]
Bharat Viswanadham edited comment on HDDS-5228 at 5/18/21, 10:04 AM:
---------------------------------------------------------------------
{quote}Even if RpcClient is shared across threads, they all will have the same
FailoverProxyProvider. If the 1st thread fails over and discovers the leader
OM, all the subsequent requests from any thread) will be directed to the
correct OM. I do not see how the retry count will be exhausted because of
shared threads. Please let me know if I am missing something here.{quote}
The problem here is we update the currentProxyNodeId in RetryPolicy#shouldRetry.
So, lets say 2 threads both contacting OM1 and if OM1 is down.
T1 updates the proxy to OM2 in RetryPolicy#shouldRetry and updates the proxy in
proxyDescriptor.
T2 updates the proxy to OM3 RetryPolicy#shouldRetry and updates the proxy in
proxyDescriptor.
So here if T1 and T2 are running in parallel, once after T1 updates, T2 should
not update it.
If there is another thread running it will update again to OM1. And we will be
contacting same OM again and can exhaust retry counts.
RetryInvocationhandler this case by comparing expected failOverCount and not
calling performFailOver, but our performFailOver is a no-op and
currentProxyNodeId is update in shouldRetry, this is the root cause for the
problem.
This was observed in SCM, as the SCM client is shared across OM RPC Handler
threads, as we create a single SCM client in OM.
Recently we have fixed this for SCM, for more info refers to this.
https://github.com/apache/ozone/pull/2249#issue-645725169
was (Author: bharatviswa):
{quote}Even if RpcClient is shared across threads, they all will have the same
FailoverProxyProvider. If the 1st thread fails over and discovers the leader
OM, all the subsequent requests from any thread) will be directed to the
correct OM. I do not see how the retry count will be exhausted because of
shared threads. Please let me know if I am missing something here.{quote}
The problem here is we update the currentProxyNodeId in RetryPolicy#shouldRetry.
So, lets say 2 threads both contacting OM1 and if OM1 is down.
T1 updates the proxy to OM2 in RetryPolicy#shouldRetry and updates the proxy in
proxyDescriptor.
T2 updates the proxy to OM3 RetryPolicy#shouldRetry and updates the proxy in
proxyDescriptor.
So here if T1 and T2 are running in parallel, once after T1 updates, T2 should
not update it.
If there is another thread running it will update again to OM1. And we will be
contacting same OM again and can exhaust retry counts.
RetryInvocationhandler this case by comparing expected failOverCount and not
calling performFailOver, but our performFailOver is a no-op and
currentProxyNodeId is update in shouldRetry.
This was observed in SCM, as the SCM client is shared across OM RPC Handler
threads, as we create a single SCM client in OM.
Recently we have fixed this for SCM, for more info refers to this.
https://github.com/apache/ozone/pull/2249#issue-645725169
> Make OM FailOverProxyProvider work across threads
> -------------------------------------------------
>
> Key: HDDS-5228
> URL: https://issues.apache.org/jira/browse/HDDS-5228
> Project: Apache Ozone
> Issue Type: Improvement
> Reporter: Bharat Viswanadham
> Assignee: Bharat Viswanadham
> Priority: Major
>
> Use perform failover for doing perform failover instead of updating proxy in
> RetryPolocy#shouldRetry.
> With this, if RpcClient shared across threads it will unnecessarily exhaust
> the retry count.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]