devmadhuu opened a new pull request, #9414: URL: https://github.com/apache/ozone/pull/9414
## What changes were proposed in this pull request? The test is failing due to an insufficient timeout that doesn't account for the complete timing window required for a node to be detected as DEAD. **The problem:** The health checker is not guaranteed to run immediately after the 8-second dead node interval expires. It runs on a 3-second schedule, so detection can be delayed up to 3 additional seconds. **Additional Contributing Factors** - No explicit heartbeat checker interval configuration: The test doesn't set ozone.scm.heartbeat.thread.interval, so it uses the default 3-second value. - Scheduler variance: The health checker uses `scheduleWithFixedDelay` semantics (from `NodeStateManager.run()`) **Please describe your PR in detail:** The test is flaky because the 10-second timeout is mathematically insufficient. With an 8-second dead node interval and a 3-second health checker interval, the worst-case detection time is 11 seconds. Add in scheduler variance and GC pauses (especially in Java 21), and timeouts become inevitable. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-11645 ## How was this patch tested? The flaky test is tested by flaky workflow run 2 times with each run executed the test 100 times. Here are flaky workflow run pass results: https://github.com/devmadhuu/ozone/actions/runs/19850610337 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
