[
https://issues.apache.org/jira/browse/HDDS-8179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705292#comment-17705292
]
Stephen O'Donnell commented on HDDS-8179:
-----------------------------------------
Decommission deems a container unhealthy based on the following logic - ie the
container must be CLOSED or QUASIS_CLOSED and all replicas should be in the
same state:
{code}
default boolean isHealthy() {
HddsProtos.LifeCycleState containerState = getContainer().getState();
return (containerState == HddsProtos.LifeCycleState.CLOSED
|| containerState == HddsProtos.LifeCycleState.QUASI_CLOSED)
&& getReplicas().stream()
.filter(r -> r.getDatanodeDetails().getPersistedOpState() == IN_SERVICE)
.allMatch(r -> LegacyReplicationManager.compareState(
containerState, r.getState()));
}
{code}
In this case, the container state is DELETING and all the replicas are closed
so decommission will not move forward. If the delete replicas wasn't broken,
then the replicas for this container would be removed and the container would
go to deleted and decommission would complete.
The only change I think we need to make is to improve the log message to
include the container state, which is currently missing.
> Datanode decommissioning blocked due to unhealthy container
> -----------------------------------------------------------
>
> Key: HDDS-8179
> URL: https://issues.apache.org/jira/browse/HDDS-8179
> Project: Apache Ozone
> Issue Type: Bug
> Components: ECOfflineRecovery, SCM
> Reporter: Varsha Ravi
> Priority: Major
>
> The Replication Manager is sending delete container command to a non-empty
> container due to HDDS-7775. The container is not deleted but the *subsequent
> decommissioning calls to any of the DNs is not completing* because the
> container is in under-replicated as well as unhealthy state.
> *SCM.log:*
> {noformat}
> 2023-03-14 21:53:26,413 INFO
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager: Sending
> command [deleteContainerCommand: containerID: 15019, replicaIndex: 1, force:
> false] for container ContainerInfo{id=#15019, state=DELETING,
> pipelineID=PipelineID=e3fb8629-89ee-472a-9c43-3962629bd7a9,
> stateEnterTime=2023-03-14T19:17:07.315Z, owner=om2} to
> 1ca038f8-c505-47ca-b701-d542b85bb75b
> 2023-03-14 21:53:26,413 INFO
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager: Sending
> command [deleteContainerCommand: containerID: 15019, replicaIndex: 5, force:
> false] for container ContainerInfo{id=#15019, state=DELETING,
> pipelineID=PipelineID=e3fb8629-89ee-472a-9c43-3962629bd7a9,
> stateEnterTime=2023-03-14T19:17:07.315Z, owner=om2} to
> 1ac8e090-7eb7-4dab-93b7-97e4845f7b49
> 2023-03-14 23:19:12,206 INFO
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager: Sending
> command [deleteContainerCommand: containerID: 15019, replicaIndex: 3, force:
> false] for container ContainerInfo{id=#15019, state=DELETING,
> pipelineID=PipelineID=e3fb8629-89ee-472a-9c43-3962629bd7a9,
> stateEnterTime=2023-03-14T19:17:07.315Z, owner=om2} to
> c5c3948e-1296-4313-8c4e-9e6e50424280
> 2023-03-14 23:19:53,296 INFO
> org.apache.hadoop.hdds.scm.node.NodeDecommissionManager: Starting
> Decommission for node c5c3948e-1296-4313-8c4e-9e6e50424280
> 2023-03-14 23:22:38,512 INFO
> org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: Under Replicated
> Container #15019
> org.apache.hadoop.hdds.scm.container.replication.ECContainerReplicaCount@2bd10f2f;
> Replicas{
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=ba62c66a-a342-4147-8344-3ce91726c2dc,
> placeOfBirth=ba62c66a-a342-4147-8344-3ce91726c2dc, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=5},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=15af7526-8376-45c4-97a5-7a74b7abc678,
> placeOfBirth=15af7526-8376-45c4-97a5-7a74b7abc678, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=4},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=1ca038f8-c505-47ca-b701-d542b85bb75b,
> placeOfBirth=1ca038f8-c505-47ca-b701-d542b85bb75b, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=1},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=c5c3948e-1296-4313-8c4e-9e6e50424280,
> placeOfBirth=c5c3948e-1296-4313-8c4e-9e6e50424280, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=3},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=f689fc55-e0e3-4785-9f2a-f799e18f0578,
> placeOfBirth=f689fc55-e0e3-4785-9f2a-f799e18f0578, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=1},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=1ac8e090-7eb7-4dab-93b7-97e4845f7b49,
> placeOfBirth=1ac8e090-7eb7-4dab-93b7-97e4845f7b49, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=5}}
> 2023-03-14 23:22:38,512 INFO
> org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: Unhealthy Container
> #15019
> org.apache.hadoop.hdds.scm.container.replication.ECContainerReplicaCount@2bd10f2f;
> Replicas{
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=ba62c66a-a342-4147-8344-3ce91726c2dc,
> placeOfBirth=ba62c66a-a342-4147-8344-3ce91726c2dc, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=5},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=15af7526-8376-45c4-97a5-7a74b7abc678,
> placeOfBirth=15af7526-8376-45c4-97a5-7a74b7abc678, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=4},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=1ca038f8-c505-47ca-b701-d542b85bb75b,
> placeOfBirth=1ca038f8-c505-47ca-b701-d542b85bb75b, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=1},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=c5c3948e-1296-4313-8c4e-9e6e50424280,
> placeOfBirth=c5c3948e-1296-4313-8c4e-9e6e50424280, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=3},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=f689fc55-e0e3-4785-9f2a-f799e18f0578,
> placeOfBirth=f689fc55-e0e3-4785-9f2a-f799e18f0578, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=1},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=1ac8e090-7eb7-4dab-93b7-97e4845f7b49,
> placeOfBirth=1ac8e090-7eb7-4dab-93b7-97e4845f7b49, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=5}}
> 2023-03-14 23:22:38,512 INFO
> org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl:
> c5c3948e-1296-4313-8c4e-9e6e50424280 has 60 sufficientlyReplicated, 1
> underReplicated and 1 unhealthy containers{noformat}
> *DN.log:*
> {noformat}
> 2023-03-14 21:53:32,032 ERROR
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler: Received
> container deletion command for container 15019 but the container is not empty
> with blockCount 1
> 2023-03-14 21:53:32,035 ERROR
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.DeleteContainerCommandHandler:
> Exception occurred while deleting the container.
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
> Non-force deletion of non-empty container is not allowed.
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.deleteInternal(KeyValueHandler.java:1303)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.deleteContainer(KeyValueHandler.java:1160)
> at
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.deleteContainer(ContainerController.java:182)
> at
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.DeleteContainerCommandHandler.handleInternal(DeleteContainerCommandHandler.java:108)
> at
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.DeleteContainerCommandHandler.lambda$handle$0(DeleteContainerCommandHandler.java:78)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834){noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]