greenwich opened a new pull request, #9546:
URL: https://github.com/apache/ozone/pull/9546

   ## What changes were proposed in this pull request?
   This PR makes `OZONE_CLIENT_FAILOVER_MAX_ATTEMPTS_KEY` in S3g apply per 
request.
   
   1. in 
`hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java`
 line 93: we have `private int failoverCount = 0;` - All threads share this 
counter; it never resets.
   2. Also, in `GrpcOmTransport.shouldRetry`(258) we run `action = 
retryPolicy.shouldRetry((Exception)ex, 0, failoverCount++, true);` which is 
also shared between requests.
   3. Next in `OMFailoverProxyProviderBase.getRetryPolicy.getRetryAction`, we 
still use that global `failoverCount` checking ` if (failovers < 
maxFailovers)`(258), which always returns `return RetryAction.FAIL;`(263) once 
we reached the `maxFailovers`
   4. `maxFailovers` from above is defined by 
`OZONE_CLIENT_FAILOVER_MAX_ATTEMPTS_KEY`
   
   I propose to change the value of `failoverCount` per request, rather than 
making it a global flag.
   
   Detailed explanation and discussion are here: 
https://github.com/apache/ozone/discussions/9477
   
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-14212
   
   ## How was this patch tested?
   Branch pipeline: https://github.com/greenwich/ozone/actions/runs/20430980506
   - Added a unittest to check that `failoverCount` is set for each request
   - Before fixing the bug, I created a new integration test that mimics 
concurrent user requests. It reproduces the issue, then I used it to test the 
fix.
   Concurrent test results before the fix - demonstrating the bug -> failover 
didn't happened to om2
   ```
   --- Failed Requests (Failover Attempts) ---
   om0: 500 failures (10.0 %)
   om1: 4510 failures (90.0 %)
   om2: 0 failures (0.0 %) NEVER TRIED!
   
   --- Successful Requests ---
   om0: 5 successes (100.0 %) LEADER
   om1: 0 successes (0.0 %)
   om2: 0 successes (0.0 %)
   ```
   
   Concurrent test results after the fix - demonstrating the right behaviour -> 
failover to om2
   ```
   --- Failed Requests (Failover Attempts) ---
   om0: 500 failures (97.1 %)
   om1: 15 failures (2.9 %)
   om2: 0 failures (0.0 %) NEVER TRIED!
   --- Successful Requests ---
   om0: 5 successes (0.1 %) LEADER
   om1: 0 successes (0.0 %)
   om2: 5000 successes (99.9 %) LEADER
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to