devmadhuu opened a new pull request, #9414:
URL: https://github.com/apache/ozone/pull/9414

   ## What changes were proposed in this pull request?
   The test is failing due to an insufficient timeout that doesn't account for 
the complete timing window required for a node to be detected as DEAD.
   **The problem:** The health checker is not guaranteed to run immediately 
after the 8-second dead node interval expires. It runs on a 3-second schedule, 
so detection can be delayed up to 3 additional seconds.
   
   **Additional Contributing Factors**
   
   - No explicit heartbeat checker interval configuration: The test doesn't set 
ozone.scm.heartbeat.thread.interval, so it uses the default 3-second value.
   - Scheduler variance: The health checker uses `scheduleWithFixedDelay` 
semantics (from `NodeStateManager.run()`)
   
   **Please describe your PR in detail:**
   The test is flaky because the 10-second timeout is mathematically 
insufficient. With an 8-second dead node interval and a 3-second health checker 
interval, the worst-case detection time is 11 seconds. Add in scheduler 
variance and GC pauses (especially in Java 21), and timeouts become inevitable.
   
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-11645
   
   ## How was this patch tested?
   The flaky test is tested by flaky workflow run 2 times with each run 
executed the test 100 times. Here are flaky workflow run pass results: 
   https://github.com/devmadhuu/ozone/actions/runs/19850610337
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to