adoroszlai opened a new pull request, #5476:
URL: https://github.com/apache/ozone/pull/5476

   ## What changes were proposed in this pull request?
   
   Fix intermittent error in `testAllVolumeOperations`: `OMNotLeaderException: 
OM:omNode-1 is not the leader. Suggested leader is 
OM:omNode-2[localhost/127.0.0.1].`
   
   `TestOzoneManagerHA` subclasses use a single cluster for all tests, and all 
OM instances are restarted after each test case.
   
   The patch makes 3 main changes:
   
   1. add "wait for OM leader election" before each test case
   2. mark all OMs as active when restarting them (HA mini cluster keeps track 
of active and inactive OMs.  OM stopped via `stopOzoneManager()` is marked as 
inactive.  Before this change `restartOzoneManager()` still starts all OMs, 
even inactive ones.  But `getLeaderOM()` only considers active ones, thus we 
may not find the actual leader if it is left as "inactive".)
   3. wait for OM RPC server to really stop (call `join()`) when restarting OM. 
 Avoid calling `join()` if OM is already stopped, as that would wait for 
`notifyAll()` without anyone signalling.
   
   https://issues.apache.org/jira/browse/HDDS-9429
   
   ## How was this patch tested?
   
   `TestOzoneManagerHAMetadataOnly` passed in 300 runs:
   https://github.com/adoroszlai/hadoop-ozone/actions/runs/6602838480
   
   On `master` it failed in 17/300 runs:
   
https://github.com/adoroszlai/hadoop-ozone/actions/runs/6602302723/job/17935202601


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to