[
https://issues.apache.org/jira/browse/HDDS-14994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang updated HDDS-14994:
-----------------------------------
Description:
[https://ozone.apache.org/docs/next/administrator-guide/operations/node-decommissioning-and-maintenance/datanodes/datanode-maintenance]
We talked about different data node state: *IN_SERVICE,* *ENTERING_MAINTENANCE*
and * ** IN_MAINTENANCE{*}{{*}}{*}{{*}}\{*}
But we did not explain when a datanode transitions from *ENTERING_MAINTENANCE*
to {*}IN_MAINTENANCE{*}.
The following is the full internal details. Need to translate it into a simple
user facing description.
—
The Transitionary Phase (ENTERING_MAINTENANCE)
Once a node is in the trackedNodes set, the processTransitioningNodes()
method checks its status on every "tick":
- How it checks: It calls nodeManager.getNodeStatus(dn), which returns a
NodeStatus object containing the NodeOperationalState.
- Logic: If status.isEnteringMaintenance() is true, the monitor performs
several checks before allowing it to fully enter maintenance:
1. Pipeline Closure: checkPipelinesClosedOnNode ensures all Ratis/EC
pipelines on the DN are closed.
2. DN Persistence: It checks status.getOperationalState() ==
dn.getPersistedOpState(). This ensures the Datanode has actually received the
command
and persisted the "Entering Maintenance" state.
3. Container Replication: checkContainersReplicatedOnNode checks if all
containers on the node are sufficiently replicated elsewhere so the node can
safely go offline.
"sufficiently replicated elsewhere"
This means the monitor has verified that every piece of data (container)
stored on that Datanode has enough "healthy" copies elsewhere in the cluster.
Specifically, it checks:
* Offline Health: It calls replicaSet.isHealthyEnoughForOffline(). This
ensures the container isn't in a "closed" or "quasi-closed" state that would be
endangered if this node disappeared.
* Replication Count: It asks the ReplicationManager if the container is
UNDER_REPLICATED.
* If a container is supposed to have 3 replicas and it has 3, it is
"safe" (the node can go into maintenance because 2 replicas will remain).
* If it only has 2 replicas, it is "Under Replicated." The monitor will
block the maintenance transition until the cluster creates a new copy on a
different node.
* Wait until zero: The node cannot enter maintenance until the number of
underReplicated and unclosed containers on that specific node reaches zero.
---
the maintenance mode should have gone through if under replicated == 0 &&
unclosed == 0.
For EC containers, it is unclosed if it's unhealthy: not CLOSED nor
QUASI_CLOSED.
was:
[https://ozone.apache.org/docs/next/administrator-guide/operations/node-decommissioning-and-maintenance/datanodes/datanode-maintenance]
We talked about different data node state: *IN_SERVICE,* *ENTERING_MAINTENANCE*
and ** {*}IN_MAINTENANCE{*}{*}{*}{*}{*}
But we did not explain when a datanode transitions from *ENTERING_MAINTENANCE*
to {*}IN_MAINTENANCE{*}.
The following is the full internal details. Need to translate it into a simple
user facing description.
---
The Transitionary Phase (ENTERING_MAINTENANCE)
Once a node is in the trackedNodes set, the processTransitioningNodes()
method checks its status on every "tick":
- How it checks: It calls nodeManager.getNodeStatus(dn), which returns a
NodeStatus object containing the NodeOperationalState.
- Logic: If status.isEnteringMaintenance() is true, the monitor performs
several checks before allowing it to fully enter maintenance:
1. Pipeline Closure: checkPipelinesClosedOnNode ensures all Ratis/EC
pipelines on the DN are closed.
2. DN Persistence: It checks status.getOperationalState() ==
dn.getPersistedOpState(). This ensures the Datanode has actually received the
command
and persisted the "Entering Maintenance" state.
3. Container Replication: checkContainersReplicatedOnNode checks if all
containers on the node are sufficiently replicated elsewhere so the node can
safely go offline.
"sufficiently replicated elsewhere"
This means the monitor has verified that every piece of data (container)
stored on that Datanode has enough "healthy" copies elsewhere in the cluster.
Specifically, it checks:
* Offline Health: It calls replicaSet.isHealthyEnoughForOffline(). This
ensures the container isn't in a "closed" or "quasi-closed" state that would be
endangered if this node disappeared.
* Replication Count: It asks the ReplicationManager if the container is
UNDER_REPLICATED.
* If a container is supposed to have 3 replicas and it has 3, it is
"safe" (the node can go into maintenance because 2 replicas will remain).
* If it only has 2 replicas, it is "Under Replicated." The monitor will
block the maintenance transition until the cluster creates a new copy on a
different node.
* Wait until zero: The node cannot enter maintenance until the number of
underReplicated and unclosed containers on that specific node reaches zero.
> [Docs] Explain the criteria when a datanode is in maintainance mode
> -------------------------------------------------------------------
>
> Key: HDDS-14994
> URL: https://issues.apache.org/jira/browse/HDDS-14994
> Project: Apache Ozone
> Issue Type: Task
> Reporter: Wei-Chiu Chuang
> Priority: Major
>
> [https://ozone.apache.org/docs/next/administrator-guide/operations/node-decommissioning-and-maintenance/datanodes/datanode-maintenance]
>
> We talked about different data node state: *IN_SERVICE,*
> *ENTERING_MAINTENANCE* and * ** IN_MAINTENANCE{*}{{*}}{*}{{*}}\{*}
> But we did not explain when a datanode transitions from
> *ENTERING_MAINTENANCE* to {*}IN_MAINTENANCE{*}.
>
> The following is the full internal details. Need to translate it into a
> simple user facing description.
> —
> The Transitionary Phase (ENTERING_MAINTENANCE)
> Once a node is in the trackedNodes set, the processTransitioningNodes()
> method checks its status on every "tick":
> - How it checks: It calls nodeManager.getNodeStatus(dn), which returns a
> NodeStatus object containing the NodeOperationalState.
> - Logic: If status.isEnteringMaintenance() is true, the monitor performs
> several checks before allowing it to fully enter maintenance:
> 1. Pipeline Closure: checkPipelinesClosedOnNode ensures all Ratis/EC
> pipelines on the DN are closed.
> 2. DN Persistence: It checks status.getOperationalState() ==
> dn.getPersistedOpState(). This ensures the Datanode has actually received the
> command
> and persisted the "Entering Maintenance" state.
> 3. Container Replication: checkContainersReplicatedOnNode checks if
> all containers on the node are sufficiently replicated elsewhere so the node
> can
> safely go offline.
> "sufficiently replicated elsewhere"
> This means the monitor has verified that every piece of data (container)
> stored on that Datanode has enough "healthy" copies elsewhere in the cluster.
> Specifically, it checks:
> * Offline Health: It calls replicaSet.isHealthyEnoughForOffline(). This
> ensures the container isn't in a "closed" or "quasi-closed" state that would
> be
> endangered if this node disappeared.
> * Replication Count: It asks the ReplicationManager if the container is
> UNDER_REPLICATED.
> * If a container is supposed to have 3 replicas and it has 3, it is
> "safe" (the node can go into maintenance because 2 replicas will remain).
> * If it only has 2 replicas, it is "Under Replicated." The monitor
> will block the maintenance transition until the cluster creates a new copy on
> a
> different node.
> * Wait until zero: The node cannot enter maintenance until the number of
> underReplicated and unclosed containers on that specific node reaches zero.
>
> ---
> the maintenance mode should have gone through if under replicated == 0 &&
> unclosed == 0.
> For EC containers, it is unclosed if it's unhealthy: not CLOSED nor
> QUASI_CLOSED.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]