Wei-Chiu Chuang created HDDS-14994:
--------------------------------------
Summary: [Docs] Explain the criteria when a datanode is in
maintainance mode
Key: HDDS-14994
URL: https://issues.apache.org/jira/browse/HDDS-14994
Project: Apache Ozone
Issue Type: Task
Reporter: Wei-Chiu Chuang
[https://ozone.apache.org/docs/next/administrator-guide/operations/node-decommissioning-and-maintenance/datanodes/datanode-maintenance]
We talked about different data node state: *IN_SERVICE,* *ENTERING_MAINTENANCE*
and ** {*}IN_MAINTENANCE{*}{*}{*}{*}{*}
But we did not explain when a datanode transitions from *ENTERING_MAINTENANCE*
to {*}IN_MAINTENANCE{*}.
The following is the full internal details. Need to translate it into a simple
user facing description.
---
The Transitionary Phase (ENTERING_MAINTENANCE)
Once a node is in the trackedNodes set, the processTransitioningNodes()
method checks its status on every "tick":
- How it checks: It calls nodeManager.getNodeStatus(dn), which returns a
NodeStatus object containing the NodeOperationalState.
- Logic: If status.isEnteringMaintenance() is true, the monitor performs
several checks before allowing it to fully enter maintenance:
1. Pipeline Closure: checkPipelinesClosedOnNode ensures all Ratis/EC
pipelines on the DN are closed.
2. DN Persistence: It checks status.getOperationalState() ==
dn.getPersistedOpState(). This ensures the Datanode has actually received the
command
and persisted the "Entering Maintenance" state.
3. Container Replication: checkContainersReplicatedOnNode checks if all
containers on the node are sufficiently replicated elsewhere so the node can
safely go offline.
"sufficiently replicated elsewhere"
This means the monitor has verified that every piece of data (container)
stored on that Datanode has enough "healthy" copies elsewhere in the cluster.
Specifically, it checks:
* Offline Health: It calls replicaSet.isHealthyEnoughForOffline(). This
ensures the container isn't in a "closed" or "quasi-closed" state that would be
endangered if this node disappeared.
* Replication Count: It asks the ReplicationManager if the container is
UNDER_REPLICATED.
* If a container is supposed to have 3 replicas and it has 3, it is
"safe" (the node can go into maintenance because 2 replicas will remain).
* If it only has 2 replicas, it is "Under Replicated." The monitor will
block the maintenance transition until the cluster creates a new copy on a
different node.
* Wait until zero: The node cannot enter maintenance until the number of
underReplicated and unclosed containers on that specific node reaches zero.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]