[
https://issues.apache.org/jira/browse/HDDS-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated HDDS-14834:
-------------------------------
Description:
We found that there is a race condition on the cluster map between
DeadNodeHandler and HealthyReadOnlyNodeHandler when doing a rolling DN restarts
* DeadNodeHandler: Removes the node from the topology
** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run
health check that will run periodically (see scheduleNextHealthCheck)
* HealthyReadOnlyNodeHandler: Add the node from the topology
** Triggered by DN heartbeat from DN that was resurrected
If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we
might have this interleaving
# DeadNodeHandler is invoked, but has not removed the network topology since
it is still working on other things like closing containers, destroying
pipelines, etc
# HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and add
to the network topology
# DeadNodeHandler removed the network topology
The outcome is that the node does not exist in the topology although it is
healthy. This can cause issues with the placement policy since the topology
information of the DN does not exist.
was:
We found that there is a race condition on the cluster map between
DeadNodeHandler and HealthyReadOnlyNodeHandler when doing a rolling DN restarts
* DeadNodeHandler: Removes the node from the topology
** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run
health check that will run periodically (see scheduleNextHealthCheck)
* HealthyReadOnlyNodeHandler: Add the node from the topology
** Triggered by DN heartbeat from DN that was resurrected
If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we
might have this interleaving
# DeadNodeHandler is invoked, but has not removed the network topology since
it is still working on other things like closing containers, destroying
pipelines, etc
# HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and add
to the network topology
# DeadNodeHandler removed the network topology
The outcome is that the node does not exist in the topology although it is
healthy. This can cause issues with the placement policy.
> SCM NetworkTopology race condition
> ----------------------------------
>
> Key: HDDS-14834
> URL: https://issues.apache.org/jira/browse/HDDS-14834
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Ivan Andika
> Priority: Major
>
> We found that there is a race condition on the cluster map between
> DeadNodeHandler and HealthyReadOnlyNodeHandler when doing a rolling DN
> restarts
> * DeadNodeHandler: Removes the node from the topology
> ** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run
> health check that will run periodically (see scheduleNextHealthCheck)
> * HealthyReadOnlyNodeHandler: Add the node from the topology
> ** Triggered by DN heartbeat from DN that was resurrected
> If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we
> might have this interleaving
> # DeadNodeHandler is invoked, but has not removed the network topology since
> it is still working on other things like closing containers, destroying
> pipelines, etc
> # HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and
> add to the network topology
> # DeadNodeHandler removed the network topology
> The outcome is that the node does not exist in the topology although it is
> healthy. This can cause issues with the placement policy since the topology
> information of the DN does not exist.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]