[jira] [Updated] (HDDS-14994) [Docs] Explain the criteria when a datanode is in maintainance mode

Wei-Chiu Chuang (Jira) Wed, 08 Apr 2026 15:25:11 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-14994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wei-Chiu Chuang updated HDDS-14994:
-----------------------------------
    Description: 
[https://ozone.apache.org/docs/next/administrator-guide/operations/node-decommissioning-and-maintenance/datanodes/datanode-maintenance]

 

We talked about different data node state: *IN_SERVICE,* *ENTERING_MAINTENANCE* 
and * ** IN_MAINTENANCE{*}{{*}}{*}{{*}}\{*}

But we did not explain when a datanode transitions from *ENTERING_MAINTENANCE* 
to {*}IN_MAINTENANCE{*}.

 

The following is the full internal details. Need to translate it into a simple 
user facing description.

—

The Transitionary Phase (ENTERING_MAINTENANCE)
  Once a node is in the trackedNodes set, the processTransitioningNodes() 
method checks its status on every "tick":
   - How it checks: It calls nodeManager.getNodeStatus(dn), which returns a 
NodeStatus object containing the NodeOperationalState.
   - Logic: If status.isEnteringMaintenance() is true, the monitor performs 
several checks before allowing it to fully enter maintenance:
       1. Pipeline Closure: checkPipelinesClosedOnNode ensures all Ratis/EC 
pipelines on the DN are closed.
       2. DN Persistence: It checks status.getOperationalState() == 
dn.getPersistedOpState(). This ensures the Datanode has actually received the 
command
          and persisted the "Entering Maintenance" state.
       3. Container Replication: checkContainersReplicatedOnNode checks if all 
containers on the node are sufficiently replicated elsewhere so the node can
          safely go offline.

"sufficiently replicated elsewhere"

  This means the monitor has verified that every piece of data (container) 
stored on that Datanode has enough "healthy" copies elsewhere in the cluster.
  Specifically, it checks:

   * Offline Health: It calls replicaSet.isHealthyEnoughForOffline(). This 
ensures the container isn't in a "closed" or "quasi-closed" state that would be
     endangered if this node disappeared.
   * Replication Count: It asks the ReplicationManager if the container is 
UNDER_REPLICATED.
       * If a container is supposed to have 3 replicas and it has 3, it is 
"safe" (the node can go into maintenance because 2 replicas will remain).
       * If it only has 2 replicas, it is "Under Replicated." The monitor will 
block the maintenance transition until the cluster creates a new copy on a
         different node.
   * Wait until zero: The node cannot enter maintenance until the number of 
underReplicated and unclosed containers on that specific node reaches zero.

 

---

the maintenance mode should have gone through if under replicated == 0 && 
unclosed == 0.

For EC containers, it is unclosed if it's unhealthy: not CLOSED nor 
QUASI_CLOSED.

  was:
[https://ozone.apache.org/docs/next/administrator-guide/operations/node-decommissioning-and-maintenance/datanodes/datanode-maintenance]

 

We talked about different data node state: *IN_SERVICE,* *ENTERING_MAINTENANCE* 
and ** {*}IN_MAINTENANCE{*}{*}{*}{*}{*}

But we did not explain when a datanode transitions from *ENTERING_MAINTENANCE* 
to {*}IN_MAINTENANCE{*}.

 

The following is the full internal details. Need to translate it into a simple 
user facing description.

---

The Transitionary Phase (ENTERING_MAINTENANCE)
  Once a node is in the trackedNodes set, the processTransitioningNodes() 
method checks its status on every "tick":
   - How it checks: It calls nodeManager.getNodeStatus(dn), which returns a 
NodeStatus object containing the NodeOperationalState.
   - Logic: If status.isEnteringMaintenance() is true, the monitor performs 
several checks before allowing it to fully enter maintenance:
       1. Pipeline Closure: checkPipelinesClosedOnNode ensures all Ratis/EC 
pipelines on the DN are closed.
       2. DN Persistence: It checks status.getOperationalState() == 
dn.getPersistedOpState(). This ensures the Datanode has actually received the 
command
          and persisted the "Entering Maintenance" state.
       3. Container Replication: checkContainersReplicatedOnNode checks if all 
containers on the node are sufficiently replicated elsewhere so the node can
          safely go offline.

"sufficiently replicated elsewhere"

  This means the monitor has verified that every piece of data (container) 
stored on that Datanode has enough "healthy" copies elsewhere in the cluster.
  Specifically, it checks:

   * Offline Health: It calls replicaSet.isHealthyEnoughForOffline(). This 
ensures the container isn't in a "closed" or "quasi-closed" state that would be
     endangered if this node disappeared.
   * Replication Count: It asks the ReplicationManager if the container is 
UNDER_REPLICATED.
       * If a container is supposed to have 3 replicas and it has 3, it is 
"safe" (the node can go into maintenance because 2 replicas will remain).
       * If it only has 2 replicas, it is "Under Replicated." The monitor will 
block the maintenance transition until the cluster creates a new copy on a
         different node.
   * Wait until zero: The node cannot enter maintenance until the number of 
underReplicated and unclosed containers on that specific node reaches zero.


> [Docs] Explain the criteria when a datanode is in maintainance mode
> -------------------------------------------------------------------
>
>                 Key: HDDS-14994
>                 URL: https://issues.apache.org/jira/browse/HDDS-14994
>             Project: Apache Ozone
>          Issue Type: Task
>            Reporter: Wei-Chiu Chuang
>            Priority: Major
>
> [https://ozone.apache.org/docs/next/administrator-guide/operations/node-decommissioning-and-maintenance/datanodes/datanode-maintenance]
>  
> We talked about different data node state: *IN_SERVICE,* 
> *ENTERING_MAINTENANCE* and * ** IN_MAINTENANCE{*}{{*}}{*}{{*}}\{*}
> But we did not explain when a datanode transitions from 
> *ENTERING_MAINTENANCE* to {*}IN_MAINTENANCE{*}.
>  
> The following is the full internal details. Need to translate it into a 
> simple user facing description.
> —
> The Transitionary Phase (ENTERING_MAINTENANCE)
>   Once a node is in the trackedNodes set, the processTransitioningNodes() 
> method checks its status on every "tick":
>    - How it checks: It calls nodeManager.getNodeStatus(dn), which returns a 
> NodeStatus object containing the NodeOperationalState.
>    - Logic: If status.isEnteringMaintenance() is true, the monitor performs 
> several checks before allowing it to fully enter maintenance:
>        1. Pipeline Closure: checkPipelinesClosedOnNode ensures all Ratis/EC 
> pipelines on the DN are closed.
>        2. DN Persistence: It checks status.getOperationalState() == 
> dn.getPersistedOpState(). This ensures the Datanode has actually received the 
> command
>           and persisted the "Entering Maintenance" state.
>        3. Container Replication: checkContainersReplicatedOnNode checks if 
> all containers on the node are sufficiently replicated elsewhere so the node 
> can
>           safely go offline.
> "sufficiently replicated elsewhere"
>   This means the monitor has verified that every piece of data (container) 
> stored on that Datanode has enough "healthy" copies elsewhere in the cluster.
>   Specifically, it checks:
>    * Offline Health: It calls replicaSet.isHealthyEnoughForOffline(). This 
> ensures the container isn't in a "closed" or "quasi-closed" state that would 
> be
>      endangered if this node disappeared.
>    * Replication Count: It asks the ReplicationManager if the container is 
> UNDER_REPLICATED.
>        * If a container is supposed to have 3 replicas and it has 3, it is 
> "safe" (the node can go into maintenance because 2 replicas will remain).
>        * If it only has 2 replicas, it is "Under Replicated." The monitor 
> will block the maintenance transition until the cluster creates a new copy on 
> a
>          different node.
>    * Wait until zero: The node cannot enter maintenance until the number of 
> underReplicated and unclosed containers on that specific node reaches zero.
>  
> ---
> the maintenance mode should have gone through if under replicated == 0 && 
> unclosed == 0.
> For EC containers, it is unclosed if it's unhealthy: not CLOSED nor 
> QUASI_CLOSED.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-14994) [Docs] Explain the criteria when a datanode is in maintainance mode

Reply via email to