[ 
https://issues.apache.org/jira/browse/HDDS-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699130#comment-17699130
 ] 

Ethan Rose edited comment on HDDS-8141 at 3/10/23 10:34 PM:
------------------------------------------------------------

This may have been caused by HDDS-8129 in the following execution:
1. Block count in this container reached 0 before all blocks were actually 
deleted. This could happen if block count was too low when the container was 
closed due to HDDS-8129.
2. This datanode sent a container report to SCM, so SCM learns that the block 
count of this replica is zero.
3. Other replicas have also reported 0 block containers, either correctly or 
incorrectly.
4. SCM sees all replicas have 0 blocks and sends delete commands.
5. This datanode processes more delete blocks, making the block count for the 
container negative.
6. This datanode processes the container delete command, which fails with this 
exception [since the block count is 
negative|https://github.com/apache/ozone/blob/360a23c0a1b69f5e748bfe78d47773724625e428/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java#L1236].


was (Author: erose):
This may have been caused by HDDS-8115 in the following execution:
1. Block count in this container reached 0 before all blocks were actually 
deleted. This could happen if block count was too low when the container was 
closed due to HDDS-8115.
2. This datanode sent a container report to SCM, so SCM learns that the block 
count of this replica is zero.
3. Other replicas have also reported 0 block containers, either correctly or 
incorrectly.
4. SCM sees all replicas have 0 blocks and sends delete commands.
5. This datanode processes more delete blocks, making the block count for the 
container negative.
6. This datanode processes the container delete command, which fails with this 
exception [since the block count is 
negative|https://github.com/apache/ozone/blob/360a23c0a1b69f5e748bfe78d47773724625e428/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java#L1236].

> Exception "Non-force deletion of non-empty container is not allowed" in 
> datanode logs
> -------------------------------------------------------------------------------------
>
>                 Key: HDDS-8141
>                 URL: https://issues.apache.org/jira/browse/HDDS-8141
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Ethan Rose
>            Priority: Major
>
> This exception has been noticed a few times in datanode logs
> {code:java}
> 2023-02-16 14:57:11,330 ERROR 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler: Received 
> container deletion command for container 54652 but the container is not empty.
> 2023-02-16 14:57:11,330 ERROR 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.DeleteContainerCommandHandler:
>  Exception occurred while deleting the container.
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  Non-force deletion of non-empty container is not allowed.
>       at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.deleteInternal(KeyValueHandler.java:1133)
>       at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.deleteContainer(KeyValueHandler.java:1094)
>       at 
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.deleteContainer(ContainerController.java:182)
>       at 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.DeleteContainerCommandHandler.lambda$handle$0(DeleteContainerCommandHandler.java:75)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:750)
> {code}
> This is a defensive code path that checks the block count metadata in RocksDB 
> to determine if the container is empty. It is not expected to be hit.
> The last delete block command for this container was logged about 5 minutes 
> prior to this message. When checking the disk of a few containers where this 
> happened, we noticed there were no block files present there. Logs show SCM 
> would retry the delete but get the same result every time.
> Later on, the container inspector was run on this cluster and it reported 
> that there was only one copy of this container in the whole cluster. It had 
> the following metadata:
> {code:java}
> {
>   "containerID": 54652,
>   "schemaVersion": "2",
>   "containerState": "CLOSED",
>   "currentDatanodeID": "a160b3e2-a450-446d-a75c-898241a1ff7a",
>   "originDatanodeID": "a160b3e2-a450-446d-a75c-898241a1ff7a",
>   "dBMetadata": {
>     "#BLOCKCOUNT": -6,
>     "#BYTESUSED": -1431232412,
>     "#PENDINGDELETEBLOCKCOUNT": 0,
>     "#delTX": 46312,
>     "#BCSID": 1548650
>   },
>   "aggregates": {
>     "blockCount": 0,
>     "usedBytes": 0,
>     "pendingDeleteBlocks": 0,
>     "pendingDeleteBytes": 0
>   },
>   "chunksDirectory": {
>     "path": "<disk mount path>/current/containerDir106/54652/chunks",
>     "present": true,
>     "fileCount": 0
>   },
>   "dBMetadataDeleteCount_minus_aggregatedDeleteCount": 0,
>   "correct": false,
>   "errors": [
>     {
>       "property": "dBMetadata.#BLOCKCOUNT",
>       "expected": 0,
>       "actual": -6,
>       "repaired": false
>     },
>     {
>       "property": "dBMetadata.#BYTESUSED",
>       "expected": 0,
>       "actual": -1431232412,
>       "repaired": false
>     }
>   ]
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to