[ 
https://issues.apache.org/jira/browse/HDDS-14994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDDS-14994:
--------------------------------------

    Assignee: Wei-Chiu Chuang

> [Docs] Explain the criteria when a datanode is in maintainance mode
> -------------------------------------------------------------------
>
>                 Key: HDDS-14994
>                 URL: https://issues.apache.org/jira/browse/HDDS-14994
>             Project: Apache Ozone
>          Issue Type: Task
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>            Priority: Major
>              Labels: pull-request-available
>
> [https://ozone.apache.org/docs/next/administrator-guide/operations/node-decommissioning-and-maintenance/datanodes/datanode-maintenance]
>  
> We talked about different data node state: *IN_SERVICE,* 
> *ENTERING_MAINTENANCE* and * ** IN_MAINTENANCE{*}{{*}}{*}{{*}}\{*}
> But we did not explain when a datanode transitions from 
> *ENTERING_MAINTENANCE* to {*}IN_MAINTENANCE{*}.
>  
> The following is the full internal details. Need to translate it into a 
> simple user facing description.
> —
> The Transitionary Phase (ENTERING_MAINTENANCE)
>   Once a node is in the trackedNodes set, the processTransitioningNodes() 
> method checks its status on every "tick":
>    - How it checks: It calls nodeManager.getNodeStatus(dn), which returns a 
> NodeStatus object containing the NodeOperationalState.
>    - Logic: If status.isEnteringMaintenance() is true, the monitor performs 
> several checks before allowing it to fully enter maintenance:
>        1. Pipeline Closure: checkPipelinesClosedOnNode ensures all Ratis/EC 
> pipelines on the DN are closed.
>        2. DN Persistence: It checks status.getOperationalState() == 
> dn.getPersistedOpState(). This ensures the Datanode has actually received the 
> command
>           and persisted the "Entering Maintenance" state.
>        3. Container Replication: checkContainersReplicatedOnNode checks if 
> all containers on the node are sufficiently replicated elsewhere so the node 
> can
>           safely go offline.
> "sufficiently replicated elsewhere"
>   This means the monitor has verified that every piece of data (container) 
> stored on that Datanode has enough "healthy" copies elsewhere in the cluster.
>   Specifically, it checks:
>    * Offline Health: It calls replicaSet.isHealthyEnoughForOffline(). This 
> ensures the container isn't in a "closed" or "quasi-closed" state that would 
> be
>      endangered if this node disappeared.
>    * Replication Count: It asks the ReplicationManager if the container is 
> UNDER_REPLICATED.
>        * If a container is supposed to have 3 replicas and it has 3, it is 
> "safe" (the node can go into maintenance because 2 replicas will remain).
>        * If it only has 2 replicas, it is "Under Replicated." The monitor 
> will block the maintenance transition until the cluster creates a new copy on 
> a
>          different node.
>    * Wait until zero: The node cannot enter maintenance until the number of 
> underReplicated and unclosed containers on that specific node reaches zero.
>  
> ---
> the maintenance mode should have gone through if under replicated == 0 && 
> unclosed == 0.
> For EC containers, it is unclosed if it's unhealthy: not CLOSED nor 
> QUASI_CLOSED.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to