[
https://issues.apache.org/jira/browse/HDDS-14994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang updated HDDS-14994:
-----------------------------------
Status: Patch Available (was: Open)
> [Docs] Explain the criteria when a datanode is in maintainance mode
> -------------------------------------------------------------------
>
> Key: HDDS-14994
> URL: https://issues.apache.org/jira/browse/HDDS-14994
> Project: Apache Ozone
> Issue Type: Task
> Reporter: Wei-Chiu Chuang
> Priority: Major
> Labels: pull-request-available
>
> [https://ozone.apache.org/docs/next/administrator-guide/operations/node-decommissioning-and-maintenance/datanodes/datanode-maintenance]
>
> We talked about different data node state: *IN_SERVICE,*
> *ENTERING_MAINTENANCE* and * ** IN_MAINTENANCE{*}{{*}}{*}{{*}}\{*}
> But we did not explain when a datanode transitions from
> *ENTERING_MAINTENANCE* to {*}IN_MAINTENANCE{*}.
>
> The following is the full internal details. Need to translate it into a
> simple user facing description.
> —
> The Transitionary Phase (ENTERING_MAINTENANCE)
> Once a node is in the trackedNodes set, the processTransitioningNodes()
> method checks its status on every "tick":
> - How it checks: It calls nodeManager.getNodeStatus(dn), which returns a
> NodeStatus object containing the NodeOperationalState.
> - Logic: If status.isEnteringMaintenance() is true, the monitor performs
> several checks before allowing it to fully enter maintenance:
> 1. Pipeline Closure: checkPipelinesClosedOnNode ensures all Ratis/EC
> pipelines on the DN are closed.
> 2. DN Persistence: It checks status.getOperationalState() ==
> dn.getPersistedOpState(). This ensures the Datanode has actually received the
> command
> and persisted the "Entering Maintenance" state.
> 3. Container Replication: checkContainersReplicatedOnNode checks if
> all containers on the node are sufficiently replicated elsewhere so the node
> can
> safely go offline.
> "sufficiently replicated elsewhere"
> This means the monitor has verified that every piece of data (container)
> stored on that Datanode has enough "healthy" copies elsewhere in the cluster.
> Specifically, it checks:
> * Offline Health: It calls replicaSet.isHealthyEnoughForOffline(). This
> ensures the container isn't in a "closed" or "quasi-closed" state that would
> be
> endangered if this node disappeared.
> * Replication Count: It asks the ReplicationManager if the container is
> UNDER_REPLICATED.
> * If a container is supposed to have 3 replicas and it has 3, it is
> "safe" (the node can go into maintenance because 2 replicas will remain).
> * If it only has 2 replicas, it is "Under Replicated." The monitor
> will block the maintenance transition until the cluster creates a new copy on
> a
> different node.
> * Wait until zero: The node cannot enter maintenance until the number of
> underReplicated and unclosed containers on that specific node reaches zero.
>
> ---
> the maintenance mode should have gone through if under replicated == 0 &&
> unclosed == 0.
> For EC containers, it is unclosed if it's unhealthy: not CLOSED nor
> QUASI_CLOSED.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]