Re: [PR] HDDS-14994. [Docs] Explain the criteria when a datanode is in maintenance mode. [ozone-site]

via GitHub Wed, 15 Apr 2026 04:37:54 -0700


sreejasahithi commented on code in PR #380:
URL: https://github.com/apache/ozone-site/pull/380#discussion_r3086122990



##########
docs/05-administrator-guide/03-operations/03-node-decommissioning-and-maintenance/03-datanodes/02-datanode-maintenance.md:
##########
@@ -11,9 +11,20 @@ While in maintenance mode, a Datanode does not accept new 
writes but may still s
 The Datanode transitions through the following operational states during 
maintenance:
 
 1. **IN_SERVICE**: The Datanode is fully operational and participating in data 
writes and reads.
-2. **ENTERING_MAINTENANCE**: The Datanode is transitioning into maintenance 
mode. New writes will be avoided.
+2. **ENTERING_MAINTENANCE**: The Datanode is transitioning into maintenance 
mode. New writes will be avoided. The SCM monitors the Datanode until it meets 
all safety criteria before allowing it to fully enter maintenance.
 3. **IN_MAINTENANCE**: The Datanode is in maintenance mode. Data will not be 
written to it. If the Datanode remains in this state beyond the configured 
maintenance window, its data will start to be replicated to other Datanodes to 
ensure data durability.
 
+### Transition Criteria (ENTERING_MAINTENANCE to IN_MAINTENANCE)
+
+A Datanode will remain in the `ENTERING_MAINTENANCE` state until the SCM 
(Storage Container Manager) verifies the following safety conditions:
+
+* **Pipeline Closure**: All open Ratis and EC pipelines on the Datanode must 
be successfully closed. This ensures no active write operations are interrupted.
+* **Datanode Acknowledgment**: The Datanode must confirm it has received the 
maintenance command and persisted the "ENTERING_MAINTENANCE" state to its local 
disk. This prevents state confusion if the Datanode is rebooted.
+* **Sufficient Replication (Data Safety)**: The SCM verifies that every 
container stored on the Datanode has enough healthy copies elsewhere in the 
cluster to remain safe while the node is offline.
+    * **Ratis (3-way)**: By default, at least 2 replicas must remain online on 
other healthy Datanodes (configurable via 
`hdds.scm.replication.maintenance.replica.minimum`).
+    * **Erasure Coding (EC)**: By default, the cluster must maintain at least 
`Data Shards + 1` available shards elsewhere (configurable via 
`hdds.scm.replication.maintenance.remaining.redundancy`). For example, in an 
RS(6,3) policy, at least 7 shards must be online.
+    * **Health Check**: Every container on the node must be in a stable state 
such as `CLOSED` or `QUASI_CLOSED`. If a container is under-replicated or not 
closed, the SCM will block the transition and trigger background replication to 
create new copies on other nodes until the safety threshold is met.
+

Review Comment:
   Here "If a container is under-replicated or not closed" looks contradicting 
to the previous sentence in this point says CLOSED and QUASI_CLOSED are 
accepted.
   we can instead change it as :
   "If a container is under-replicated or in OPEN state, the SCM will block the 
transition..."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-14994. [Docs] Explain the criteria when a datanode is in maintenance mode. [ozone-site]

Reply via email to