xBis7 commented on PR #4362: URL: https://github.com/apache/ozone/pull/4362#issuecomment-1479466470
@adoroszlai Although, the test always passes when I run it locally on repeat, when I use the workflow you shared, it fails more often than it succeeds. I have also confirmed that during repetitions there is no leader change. On the first repetition the leader is always Node-1 and it changes to Node-3 and for every other repetition it's always Node-3. I've added a check that all three OMs are up and running. I've set maximum timeout to the amount of time we wait for the Ratis failover. I don't think that the timeout should be more than that. Furthermore, I've changed the check ```diff - getCluster().getOMLeader().isLeaderReady(); + getCluster().getOMLeader(); ``` as `getOMLeader()` also checks if the leader is ready and returns null if not. Check [here](https://github.com/apache/ozone/blob/master/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/MiniOzoneHAClusterImpl.java#L194-L208). There is an underlying issue here that causes the timeout and it might even have to do with the MiniCluster and its workings. I don't have the time to investigate more and we have other priorities. I plan to convert this into a draft PR and maybe close it later on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
