sreejasahithi commented on code in PR #380: URL: https://github.com/apache/ozone-site/pull/380#discussion_r3086122990
########## docs/05-administrator-guide/03-operations/03-node-decommissioning-and-maintenance/03-datanodes/02-datanode-maintenance.md: ########## @@ -11,9 +11,20 @@ While in maintenance mode, a Datanode does not accept new writes but may still s The Datanode transitions through the following operational states during maintenance: 1. **IN_SERVICE**: The Datanode is fully operational and participating in data writes and reads. -2. **ENTERING_MAINTENANCE**: The Datanode is transitioning into maintenance mode. New writes will be avoided. +2. **ENTERING_MAINTENANCE**: The Datanode is transitioning into maintenance mode. New writes will be avoided. The SCM monitors the Datanode until it meets all safety criteria before allowing it to fully enter maintenance. 3. **IN_MAINTENANCE**: The Datanode is in maintenance mode. Data will not be written to it. If the Datanode remains in this state beyond the configured maintenance window, its data will start to be replicated to other Datanodes to ensure data durability. +### Transition Criteria (ENTERING_MAINTENANCE to IN_MAINTENANCE) + +A Datanode will remain in the `ENTERING_MAINTENANCE` state until the SCM (Storage Container Manager) verifies the following safety conditions: + +* **Pipeline Closure**: All open Ratis and EC pipelines on the Datanode must be successfully closed. This ensures no active write operations are interrupted. +* **Datanode Acknowledgment**: The Datanode must confirm it has received the maintenance command and persisted the "ENTERING_MAINTENANCE" state to its local disk. This prevents state confusion if the Datanode is rebooted. +* **Sufficient Replication (Data Safety)**: The SCM verifies that every container stored on the Datanode has enough healthy copies elsewhere in the cluster to remain safe while the node is offline. + * **Ratis (3-way)**: By default, at least 2 replicas must remain online on other healthy Datanodes (configurable via `hdds.scm.replication.maintenance.replica.minimum`). + * **Erasure Coding (EC)**: By default, the cluster must maintain at least `Data Shards + 1` available shards elsewhere (configurable via `hdds.scm.replication.maintenance.remaining.redundancy`). For example, in an RS(6,3) policy, at least 7 shards must be online. + * **Health Check**: Every container on the node must be in a stable state such as `CLOSED` or `QUASI_CLOSED`. If a container is under-replicated or not closed, the SCM will block the transition and trigger background replication to create new copies on other nodes until the safety threshold is met. + Review Comment: Here "If a container is under-replicated or not closed" looks contradicting to the previous sentence in this point says CLOSED and QUASI_CLOSED are accepted. we can instead change it as : "If a container is under-replicated or in OPEN state, the SCM will block the transition..." -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
