greenwich opened a new pull request, #9546: URL: https://github.com/apache/ozone/pull/9546
## What changes were proposed in this pull request? This PR makes `OZONE_CLIENT_FAILOVER_MAX_ATTEMPTS_KEY` in S3g apply per request. 1. in `hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java` line 93: we have `private int failoverCount = 0;` - All threads share this counter; it never resets. 2. Also, in `GrpcOmTransport.shouldRetry`(258) we run `action = retryPolicy.shouldRetry((Exception)ex, 0, failoverCount++, true);` which is also shared between requests. 3. Next in `OMFailoverProxyProviderBase.getRetryPolicy.getRetryAction`, we still use that global `failoverCount` checking ` if (failovers < maxFailovers)`(258), which always returns `return RetryAction.FAIL;`(263) once we reached the `maxFailovers` 4. `maxFailovers` from above is defined by `OZONE_CLIENT_FAILOVER_MAX_ATTEMPTS_KEY` I propose to change the value of `failoverCount` per request, rather than making it a global flag. Detailed explanation and discussion are here: https://github.com/apache/ozone/discussions/9477 ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-14212 ## How was this patch tested? Branch pipeline: https://github.com/greenwich/ozone/actions/runs/20430980506 - Added a unittest to check that `failoverCount` is set for each request - Before fixing the bug, I created a new integration test that mimics concurrent user requests. It reproduces the issue, then I used it to test the fix. Concurrent test results before the fix - demonstrating the bug -> failover didn't happened to om2 ``` --- Failed Requests (Failover Attempts) --- om0: 500 failures (10.0 %) om1: 4510 failures (90.0 %) om2: 0 failures (0.0 %) NEVER TRIED! --- Successful Requests --- om0: 5 successes (100.0 %) LEADER om1: 0 successes (0.0 %) om2: 0 successes (0.0 %) ``` Concurrent test results after the fix - demonstrating the right behaviour -> failover to om2 ``` --- Failed Requests (Failover Attempts) --- om0: 500 failures (97.1 %) om1: 15 failures (2.9 %) om2: 0 failures (0.0 %) NEVER TRIED! --- Successful Requests --- om0: 5 successes (0.1 %) LEADER om1: 0 successes (0.0 %) om2: 5000 successes (99.9 %) LEADER ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
