[ 
https://issues.apache.org/jira/browse/HDDS-8179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705304#comment-17705304
 ] 

Stephen O'Donnell commented on HDDS-8179:
-----------------------------------------

Now I think about this further, the container appears to be deleting, but the 
replica is non-empty. So even with HDDS-7775, the 
ClosedWithUnhealthyReplicasHandler with force delete, but the 
EmptyContainerHandler will not. So if SCM believes the container is empty, and 
one or more replicas have some blocks which make them non-empty, the delete 
will be stuck forever, and this in turn will affect decommission, as it expects 
the containers to all end up closed or deleted before it can continue.

This seems to be an issue with the empty container handling + delete flow 
perhaps? I am not sure what we should do here. For example, what if one 
container has some garbage data or undeleted block, but the rest of the 
containers have been removed already? For Ratis, the container would still be 
readable, but for EC it would not, so the remaining replica is somewhat useless.

> Add toString() method to ECContainerReplicaCount
> ------------------------------------------------
>
>                 Key: HDDS-8179
>                 URL: https://issues.apache.org/jira/browse/HDDS-8179
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: ECOfflineRecovery, SCM
>            Reporter: Varsha Ravi
>            Assignee: Stephen O'Donnell
>            Priority: Major
>              Labels: pull-request-available
>
> The Replication Manager is sending delete container command to a non-empty 
> container due to HDDS-7775. The container is not deleted but the *subsequent 
> decommissioning calls to any of the DNs is not completing* because the 
> container is in under-replicated as well as unhealthy state.
> *SCM.log:*
> {noformat}
> 2023-03-14 21:53:26,413 INFO 
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager: Sending 
> command [deleteContainerCommand: containerID: 15019, replicaIndex: 1, force: 
> false] for container ContainerInfo{id=#15019, state=DELETING, 
> pipelineID=PipelineID=e3fb8629-89ee-472a-9c43-3962629bd7a9, 
> stateEnterTime=2023-03-14T19:17:07.315Z, owner=om2} to 
> 1ca038f8-c505-47ca-b701-d542b85bb75b
> 2023-03-14 21:53:26,413 INFO 
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager: Sending 
> command [deleteContainerCommand: containerID: 15019, replicaIndex: 5, force: 
> false] for container ContainerInfo{id=#15019, state=DELETING, 
> pipelineID=PipelineID=e3fb8629-89ee-472a-9c43-3962629bd7a9, 
> stateEnterTime=2023-03-14T19:17:07.315Z, owner=om2} to 
> 1ac8e090-7eb7-4dab-93b7-97e4845f7b49
> 2023-03-14 23:19:12,206 INFO 
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager: Sending 
> command [deleteContainerCommand: containerID: 15019, replicaIndex: 3, force: 
> false] for container ContainerInfo{id=#15019, state=DELETING, 
> pipelineID=PipelineID=e3fb8629-89ee-472a-9c43-3962629bd7a9, 
> stateEnterTime=2023-03-14T19:17:07.315Z, owner=om2} to 
> c5c3948e-1296-4313-8c4e-9e6e50424280
> 2023-03-14 23:19:53,296 INFO 
> org.apache.hadoop.hdds.scm.node.NodeDecommissionManager: Starting 
> Decommission for node c5c3948e-1296-4313-8c4e-9e6e50424280
> 2023-03-14 23:22:38,512 INFO 
> org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: Under Replicated 
> Container #15019 
> org.apache.hadoop.hdds.scm.container.replication.ECContainerReplicaCount@2bd10f2f;
>  Replicas{
> ContainerReplica{containerID=#15019, state=CLOSED, 
> datanodeDetails=ba62c66a-a342-4147-8344-3ce91726c2dc, 
> placeOfBirth=ba62c66a-a342-4147-8344-3ce91726c2dc, sequenceId=0, keyCount=1, 
> bytesUsed=102400,replicaIndex=5},
> ContainerReplica{containerID=#15019, state=CLOSED, 
> datanodeDetails=15af7526-8376-45c4-97a5-7a74b7abc678, 
> placeOfBirth=15af7526-8376-45c4-97a5-7a74b7abc678, sequenceId=0, keyCount=1, 
> bytesUsed=102400,replicaIndex=4},
> ContainerReplica{containerID=#15019, state=CLOSED, 
> datanodeDetails=1ca038f8-c505-47ca-b701-d542b85bb75b, 
> placeOfBirth=1ca038f8-c505-47ca-b701-d542b85bb75b, sequenceId=0, keyCount=1, 
> bytesUsed=102400,replicaIndex=1},
> ContainerReplica{containerID=#15019, state=CLOSED, 
> datanodeDetails=c5c3948e-1296-4313-8c4e-9e6e50424280, 
> placeOfBirth=c5c3948e-1296-4313-8c4e-9e6e50424280, sequenceId=0, keyCount=1, 
> bytesUsed=102400,replicaIndex=3},
> ContainerReplica{containerID=#15019, state=CLOSED, 
> datanodeDetails=f689fc55-e0e3-4785-9f2a-f799e18f0578, 
> placeOfBirth=f689fc55-e0e3-4785-9f2a-f799e18f0578, sequenceId=0, keyCount=1, 
> bytesUsed=102400,replicaIndex=1},
> ContainerReplica{containerID=#15019, state=CLOSED, 
> datanodeDetails=1ac8e090-7eb7-4dab-93b7-97e4845f7b49, 
> placeOfBirth=1ac8e090-7eb7-4dab-93b7-97e4845f7b49, sequenceId=0, keyCount=1, 
> bytesUsed=102400,replicaIndex=5}}
> 2023-03-14 23:22:38,512 INFO 
> org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: Unhealthy Container 
> #15019 
> org.apache.hadoop.hdds.scm.container.replication.ECContainerReplicaCount@2bd10f2f;
>  Replicas{
> ContainerReplica{containerID=#15019, state=CLOSED, 
> datanodeDetails=ba62c66a-a342-4147-8344-3ce91726c2dc, 
> placeOfBirth=ba62c66a-a342-4147-8344-3ce91726c2dc, sequenceId=0, keyCount=1, 
> bytesUsed=102400,replicaIndex=5},
> ContainerReplica{containerID=#15019, state=CLOSED, 
> datanodeDetails=15af7526-8376-45c4-97a5-7a74b7abc678, 
> placeOfBirth=15af7526-8376-45c4-97a5-7a74b7abc678, sequenceId=0, keyCount=1, 
> bytesUsed=102400,replicaIndex=4},
> ContainerReplica{containerID=#15019, state=CLOSED, 
> datanodeDetails=1ca038f8-c505-47ca-b701-d542b85bb75b, 
> placeOfBirth=1ca038f8-c505-47ca-b701-d542b85bb75b, sequenceId=0, keyCount=1, 
> bytesUsed=102400,replicaIndex=1},
> ContainerReplica{containerID=#15019, state=CLOSED, 
> datanodeDetails=c5c3948e-1296-4313-8c4e-9e6e50424280, 
> placeOfBirth=c5c3948e-1296-4313-8c4e-9e6e50424280, sequenceId=0, keyCount=1, 
> bytesUsed=102400,replicaIndex=3},
> ContainerReplica{containerID=#15019, state=CLOSED, 
> datanodeDetails=f689fc55-e0e3-4785-9f2a-f799e18f0578, 
> placeOfBirth=f689fc55-e0e3-4785-9f2a-f799e18f0578, sequenceId=0, keyCount=1, 
> bytesUsed=102400,replicaIndex=1},
> ContainerReplica{containerID=#15019, state=CLOSED, 
> datanodeDetails=1ac8e090-7eb7-4dab-93b7-97e4845f7b49, 
> placeOfBirth=1ac8e090-7eb7-4dab-93b7-97e4845f7b49, sequenceId=0, keyCount=1, 
> bytesUsed=102400,replicaIndex=5}}
> 2023-03-14 23:22:38,512 INFO 
> org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: 
> c5c3948e-1296-4313-8c4e-9e6e50424280 has 60 sufficientlyReplicated, 1 
> underReplicated and 1 unhealthy containers{noformat}
> *DN.log:*
> {noformat}
> 2023-03-14 21:53:32,032 ERROR 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler: Received 
> container deletion command for container 15019 but the container is not empty 
> with blockCount 1
> 2023-03-14 21:53:32,035 ERROR 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.DeleteContainerCommandHandler:
>  Exception occurred while deleting the container.
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  Non-force deletion of non-empty container is not allowed.
>     at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.deleteInternal(KeyValueHandler.java:1303)
>     at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.deleteContainer(KeyValueHandler.java:1160)
>     at 
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.deleteContainer(ContainerController.java:182)
>     at 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.DeleteContainerCommandHandler.handleInternal(DeleteContainerCommandHandler.java:108)
>     at 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.DeleteContainerCommandHandler.lambda$handle$0(DeleteContainerCommandHandler.java:78)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     at java.base/java.lang.Thread.run(Thread.java:834){noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to