Aleksei Ieshin created HDDS-14212:
-------------------------------------
Summary: S3G stuck on failover to a new leader OM
Key: HDDS-14212
URL: https://issues.apache.org/jira/browse/HDDS-14212
Project: Apache Ozone
Issue Type: Bug
Components: S3, s3gateway
Reporter: Aleksei Ieshin
Our ozone cluster is running with kube; we have a bunch of kube nodes, each
node has one S3g and one DN running. Some kube nodes additionally have one OM
or one SCM instance running. We have three OMs: om0, om1, and om2.
So, for some reason, one of the kube nodes with S3g, DN, and *om1* (leader)
running went into a non-Ready state for a few minutes (so om1 was still running
but didn't serve any traffic). That caused *om2* to take over the leadership. A
few seconds later, *om1* returned to the cluster.
All S3gs failed over to the new OM leader, except one, which stuck in that
failover attempts mode. Restarting that failing S3g helped resolve the issues.
h4. Investigation
Later, the investigation showed the following:
# Cluster had a very low (non-default) setting that made it quickly exhaust
its failover limits
{{"ozone.client.wait.between.retries.millis": "250"
"ozone.client.failover.max.attempts": "16"}}
#
{{hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java}}
line 93: {{private int failoverCount = 0;}} - All threads share this counter;
it never resets.
* Also, in {{{}GrpcOmTransport.shouldRetry{}}}(258) we run {{action =
retryPolicy.shouldRetry((Exception)ex, 0, failoverCount++, true);}} Is it
intentional? Is it safe to do that?
* Next in {{{}OMFailoverProxyProviderBase.getRetryPolicy.getRetryAction{}}},
we still use that global {{failoverCount}} checking {{{} if (failovers <
maxFailovers){}}}(258), which always returns {{{}return
RetryAction.FAIL;{}}}(263) once we reached the {{maxFailovers}}
Proposal is to change `{{{}failoverCount`{}}} per request, instead of making it
a global flag?
More details here: [https://github.com/apache/ozone/discussions/9477]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]