Wei-Chiu Chuang created HDDS-14994:
--------------------------------------

             Summary: [Docs] Explain the criteria when a datanode is in 
maintainance mode
                 Key: HDDS-14994
                 URL: https://issues.apache.org/jira/browse/HDDS-14994
             Project: Apache Ozone
          Issue Type: Task
            Reporter: Wei-Chiu Chuang


[https://ozone.apache.org/docs/next/administrator-guide/operations/node-decommissioning-and-maintenance/datanodes/datanode-maintenance]

 

We talked about different data node state: *IN_SERVICE,* *ENTERING_MAINTENANCE* 
and ** {*}IN_MAINTENANCE{*}{*}{*}{*}{*}

But we did not explain when a datanode transitions from *ENTERING_MAINTENANCE* 
to {*}IN_MAINTENANCE{*}.

 

The following is the full internal details. Need to translate it into a simple 
user facing description.

---

The Transitionary Phase (ENTERING_MAINTENANCE)
  Once a node is in the trackedNodes set, the processTransitioningNodes() 
method checks its status on every "tick":
   - How it checks: It calls nodeManager.getNodeStatus(dn), which returns a 
NodeStatus object containing the NodeOperationalState.
   - Logic: If status.isEnteringMaintenance() is true, the monitor performs 
several checks before allowing it to fully enter maintenance:
       1. Pipeline Closure: checkPipelinesClosedOnNode ensures all Ratis/EC 
pipelines on the DN are closed.
       2. DN Persistence: It checks status.getOperationalState() == 
dn.getPersistedOpState(). This ensures the Datanode has actually received the 
command
          and persisted the "Entering Maintenance" state.
       3. Container Replication: checkContainersReplicatedOnNode checks if all 
containers on the node are sufficiently replicated elsewhere so the node can
          safely go offline.

"sufficiently replicated elsewhere"

  This means the monitor has verified that every piece of data (container) 
stored on that Datanode has enough "healthy" copies elsewhere in the cluster.
  Specifically, it checks:

   * Offline Health: It calls replicaSet.isHealthyEnoughForOffline(). This 
ensures the container isn't in a "closed" or "quasi-closed" state that would be
     endangered if this node disappeared.
   * Replication Count: It asks the ReplicationManager if the container is 
UNDER_REPLICATED.
       * If a container is supposed to have 3 replicas and it has 3, it is 
"safe" (the node can go into maintenance because 2 replicas will remain).
       * If it only has 2 replicas, it is "Under Replicated." The monitor will 
block the maintenance transition until the cluster creates a new copy on a
         different node.
   * Wait until zero: The node cannot enter maintenance until the number of 
underReplicated and unclosed containers on that specific node reaches zero.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to