Ivan Andika created HDDS-14834:
----------------------------------

             Summary: SCM NetworkTopology race condition
                 Key: HDDS-14834
                 URL: https://issues.apache.org/jira/browse/HDDS-14834
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: Ivan Andika


We found that there is a race condition on the cluster map between 
DeadNodeHandler and HealthyReadOnlyNodeHandler
 * DeadNodeHandler: Removes the node from the topology
 ** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run 
health check that will run periodically (see scheduleNextHealthCheck)
 * HealthyReadOnlyNodeHandler: Add the node from the topology
 ** Triggered by DN heartbeat from DN that was resurrected

If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we 
might have this interleaving
 # DeadNodeHandler is invoked, but has not removed the network topology since 
it is still working on other things like closing containers, destroying 
pipelines, etc
 # HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and add 
to the network topology
 # DeadNodeHandler removed the network topology

The outcome is that the node does not exist in the topology although it is 
healthy. This can cause issues with the placement policy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to