[
https://issues.apache.org/jira/browse/HDDS-8179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720902#comment-17720902
]
Stephen O'Donnell commented on HDDS-8179:
-----------------------------------------
Decommission will not get stuck due to an orphan block. The fix here is to make
sure the first ICR for a container replica must have a replica and hence the
counter is non-zero. That will prevent the container getting into the DELETING
state on SCM which it should not be. Provided it is not deleting, then problem
here will not occur.
There is also HDDS-8115 which proposes moving away from blockCount to an
explicit empty flag to trigger deleting. Even with that, I think we need to
have replicaIndex=1 or parity replicas having empty=true to trigger deleting.
Any other replicas could be empty when the full container group is not empty.
> Datanode decommissioning blocked due to non-empty replica of deleting
> container
> -------------------------------------------------------------------------------
>
> Key: HDDS-8179
> URL: https://issues.apache.org/jira/browse/HDDS-8179
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: ECOfflineRecovery, SCM
> Reporter: Varsha Ravi
> Assignee: Siddhant Sangwan
> Priority: Major
> Labels: pull-request-available
>
> The Replication Manager is sending delete container command to a non-empty
> container due to HDDS-7775. The container is not deleted but the *subsequent
> decommissioning calls to any of the DNs is not completing* because the
> container is in under-replicated as well as unhealthy state.
> *SCM.log:*
> {noformat}
> 2023-03-14 21:53:26,413 INFO
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager: Sending
> command [deleteContainerCommand: containerID: 15019, replicaIndex: 1, force:
> false] for container ContainerInfo{id=#15019, state=DELETING,
> pipelineID=PipelineID=e3fb8629-89ee-472a-9c43-3962629bd7a9,
> stateEnterTime=2023-03-14T19:17:07.315Z, owner=om2} to
> 1ca038f8-c505-47ca-b701-d542b85bb75b
> 2023-03-14 21:53:26,413 INFO
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager: Sending
> command [deleteContainerCommand: containerID: 15019, replicaIndex: 5, force:
> false] for container ContainerInfo{id=#15019, state=DELETING,
> pipelineID=PipelineID=e3fb8629-89ee-472a-9c43-3962629bd7a9,
> stateEnterTime=2023-03-14T19:17:07.315Z, owner=om2} to
> 1ac8e090-7eb7-4dab-93b7-97e4845f7b49
> 2023-03-14 23:19:12,206 INFO
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager: Sending
> command [deleteContainerCommand: containerID: 15019, replicaIndex: 3, force:
> false] for container ContainerInfo{id=#15019, state=DELETING,
> pipelineID=PipelineID=e3fb8629-89ee-472a-9c43-3962629bd7a9,
> stateEnterTime=2023-03-14T19:17:07.315Z, owner=om2} to
> c5c3948e-1296-4313-8c4e-9e6e50424280
> 2023-03-14 23:19:53,296 INFO
> org.apache.hadoop.hdds.scm.node.NodeDecommissionManager: Starting
> Decommission for node c5c3948e-1296-4313-8c4e-9e6e50424280
> 2023-03-14 23:22:38,512 INFO
> org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: Under Replicated
> Container #15019
> org.apache.hadoop.hdds.scm.container.replication.ECContainerReplicaCount@2bd10f2f;
> Replicas{
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=ba62c66a-a342-4147-8344-3ce91726c2dc,
> placeOfBirth=ba62c66a-a342-4147-8344-3ce91726c2dc, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=5},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=15af7526-8376-45c4-97a5-7a74b7abc678,
> placeOfBirth=15af7526-8376-45c4-97a5-7a74b7abc678, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=4},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=1ca038f8-c505-47ca-b701-d542b85bb75b,
> placeOfBirth=1ca038f8-c505-47ca-b701-d542b85bb75b, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=1},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=c5c3948e-1296-4313-8c4e-9e6e50424280,
> placeOfBirth=c5c3948e-1296-4313-8c4e-9e6e50424280, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=3},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=f689fc55-e0e3-4785-9f2a-f799e18f0578,
> placeOfBirth=f689fc55-e0e3-4785-9f2a-f799e18f0578, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=1},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=1ac8e090-7eb7-4dab-93b7-97e4845f7b49,
> placeOfBirth=1ac8e090-7eb7-4dab-93b7-97e4845f7b49, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=5}}
> 2023-03-14 23:22:38,512 INFO
> org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: Unhealthy Container
> #15019
> org.apache.hadoop.hdds.scm.container.replication.ECContainerReplicaCount@2bd10f2f;
> Replicas{
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=ba62c66a-a342-4147-8344-3ce91726c2dc,
> placeOfBirth=ba62c66a-a342-4147-8344-3ce91726c2dc, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=5},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=15af7526-8376-45c4-97a5-7a74b7abc678,
> placeOfBirth=15af7526-8376-45c4-97a5-7a74b7abc678, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=4},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=1ca038f8-c505-47ca-b701-d542b85bb75b,
> placeOfBirth=1ca038f8-c505-47ca-b701-d542b85bb75b, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=1},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=c5c3948e-1296-4313-8c4e-9e6e50424280,
> placeOfBirth=c5c3948e-1296-4313-8c4e-9e6e50424280, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=3},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=f689fc55-e0e3-4785-9f2a-f799e18f0578,
> placeOfBirth=f689fc55-e0e3-4785-9f2a-f799e18f0578, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=1},
> ContainerReplica{containerID=#15019, state=CLOSED,
> datanodeDetails=1ac8e090-7eb7-4dab-93b7-97e4845f7b49,
> placeOfBirth=1ac8e090-7eb7-4dab-93b7-97e4845f7b49, sequenceId=0, keyCount=1,
> bytesUsed=102400,replicaIndex=5}}
> 2023-03-14 23:22:38,512 INFO
> org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl:
> c5c3948e-1296-4313-8c4e-9e6e50424280 has 60 sufficientlyReplicated, 1
> underReplicated and 1 unhealthy containers{noformat}
> *DN.log:*
> {noformat}
> 2023-03-14 21:53:32,032 ERROR
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler: Received
> container deletion command for container 15019 but the container is not empty
> with blockCount 1
> 2023-03-14 21:53:32,035 ERROR
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.DeleteContainerCommandHandler:
> Exception occurred while deleting the container.
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
> Non-force deletion of non-empty container is not allowed.
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.deleteInternal(KeyValueHandler.java:1303)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.deleteContainer(KeyValueHandler.java:1160)
> at
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.deleteContainer(ContainerController.java:182)
> at
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.DeleteContainerCommandHandler.handleInternal(DeleteContainerCommandHandler.java:108)
> at
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.DeleteContainerCommandHandler.lambda$handle$0(DeleteContainerCommandHandler.java:78)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834){noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]