adoroszlai opened a new pull request, #5476: URL: https://github.com/apache/ozone/pull/5476
## What changes were proposed in this pull request? Fix intermittent error in `testAllVolumeOperations`: `OMNotLeaderException: OM:omNode-1 is not the leader. Suggested leader is OM:omNode-2[localhost/127.0.0.1].` `TestOzoneManagerHA` subclasses use a single cluster for all tests, and all OM instances are restarted after each test case. The patch makes 3 main changes: 1. add "wait for OM leader election" before each test case 2. mark all OMs as active when restarting them (HA mini cluster keeps track of active and inactive OMs. OM stopped via `stopOzoneManager()` is marked as inactive. Before this change `restartOzoneManager()` still starts all OMs, even inactive ones. But `getLeaderOM()` only considers active ones, thus we may not find the actual leader if it is left as "inactive".) 3. wait for OM RPC server to really stop (call `join()`) when restarting OM. Avoid calling `join()` if OM is already stopped, as that would wait for `notifyAll()` without anyone signalling. https://issues.apache.org/jira/browse/HDDS-9429 ## How was this patch tested? `TestOzoneManagerHAMetadataOnly` passed in 300 runs: https://github.com/adoroszlai/hadoop-ozone/actions/runs/6602838480 On `master` it failed in 17/300 runs: https://github.com/adoroszlai/hadoop-ozone/actions/runs/6602302723/job/17935202601 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
