xBis7 commented on PR #4362:
URL: https://github.com/apache/ozone/pull/4362#issuecomment-1479466470

   @adoroszlai Although, the test always passes when I run it locally on 
repeat, when I use the workflow you shared, it fails more often than it 
succeeds. 
   
   I have also confirmed that during repetitions there is no leader change. On 
the first repetition the leader is always Node-1 and it changes to Node-3 and 
for every other repetition it's always Node-3.
   
   I've added a check that all three OMs are up and running. I've set maximum 
timeout to the amount of time we wait for the Ratis failover. I don't think 
that the timeout should be more than that. Furthermore, I've changed the check
   
   ```diff
   - getCluster().getOMLeader().isLeaderReady();
   + getCluster().getOMLeader(); 
   ```
   as `getOMLeader()` also checks if the leader is ready and returns null if 
not. Check 
[here](https://github.com/apache/ozone/blob/master/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/MiniOzoneHAClusterImpl.java#L194-L208).
   
   
   There is an underlying issue here that causes the timeout and it might even 
have to do with the MiniCluster and its workings. I don't have the time to 
investigate more and we have other priorities. I plan to convert this into a 
draft PR and maybe close it later on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to