chihsuan opened a new pull request, #10556:
URL: https://github.com/apache/ozone/pull/10556

   ## What changes were proposed in this pull request?
   
   `TestDeadNodeHandler.testOnMessage` fails intermittently with either
   `NullPointerException: Parent == null` (from `HealthyReadOnlyNodeHandler`) or
   `AssertionFailedError: expected: <false> but was: <true>` (when asserting the
   dead node was removed from the cluster network topology).
   
   Root cause: the test uses a real SCM whose `NodeStateManager` runs a periodic
   health check (`ozone.scm.heartbeat.thread.interval`, default 3s). The test
   drives node health transitions manually via `setNodeHealthState`, which 
forces
   a node to `DEAD` but does not age its last heartbeat. When the background
   health check runs, it sees a `DEAD` node with a fresh heartbeat and 
resurrects
   it (`DEAD -> HEALTHY_READONLY`), concurrently mutating the `NetworkTopology`.
   This races with the handlers under test:
   
   - `DeadNodeHandler` re-reads the node status before removing it from the
     topology and skips removal when the node is no longer `DEAD`, so the node
     stays in the topology and the `assertFalse(... contains ...)` fails.
   - The concurrent topology add/remove trips the parent sanity check in
     `HealthyReadOnlyNodeHandler.onMessage`, producing the NPE.
   
   The guards in the production handlers (introduced by HDDS-14834) are correct;
   the problem is that the test does not isolate itself from the periodic health
   check. The fix sets the heartbeat process interval high in `setup()` so the
   background check does not fire during the test, matching the existing pattern
   in this package of controlling the health check via configuration
   (`TestSCMNodeManager`). The `@Flaky` tag is removed now that the root cause 
is
   addressed.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-14977
   
   ## How was this patch tested?
   
   Ran `TestDeadNodeHandler#testOnMessage` 8 times in a row locally; all passed,
   with the per-run time dropping from the previous 14-17s (under the race) to a
   steady ~8s. The full `TestDeadNodeHandler` class also passes. Verified with
   `checkstyle.sh`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to