[
https://issues.apache.org/jira/browse/HDDS-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-7300:
---------------------------------
Labels: pull-request-available (was: )
> Conflict between full data scan and block deletion
> --------------------------------------------------
>
> Key: HDDS-7300
> URL: https://issues.apache.org/jira/browse/HDDS-7300
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Xu Shao Hong
> Assignee: Xu Shao Hong
> Priority: Major
> Labels: pull-request-available
>
> We have enabled the full data scan and found that one container is marked as
> unhealthy due to the conflict between full data scan and block deletion.
> The block deleting service first deletes the block and then updates the DB,
> while the data scan first scans the DB and then checks the existence of the
> blocks.
> Once getting the DB record and finds the block not existing in the FS, the
> `Missing chunk file exception` will be thrown and the container will be
> marked as unhealthy.
>
> *The block deleting service has a write lock during the process but the data
> scan has no read lock to avoid the conflict.*
> Even by double checking the block if the block is still in the block-data
> table when the block is not found on the FS for the first time, the problem
> still happens. The flush time of DB batch operation is not predictable, so
> the direct second retrieval may not be a good solution as we cannot determine
> a fixed delay that could guarantee every batch could be flushed after this
> delay.
>
> *The log trace:*
> * 2022-09-30 16:07:38,535 [BlockDeletingService#5] INFO
> org.apache.hadoop.ozone.container.keyvalue.impl.FilePerBlockStrategy: Deleted
> block file:
> /data11/ozone-ec/data/storage/hdds/CID-9090d68f-eb34-44f0-b54f-10df5e42a347/current/containerDir12/6595/chunks/109611004723333878.block
>
> * 2022-09-30 16:07:39,244
> [ContainerDataScanner(/data11/ozone-ec/data/storage/hdds)] ERROR
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck: Corruption
> detected in container: [6595] Exception: [Missing chunk file
> /data11/ozone-ec/data/storage/hdds/CID-9090d68f-eb34-44f0-b54f-10df5e42a347/current/containerDir12/6595/chunks/109611004723333878.block]
>
> * 2022-09-30 16:07:39,545
> [ContainerDataScanner(/data11/ozone-ec/data/storage/hdds)] WARN
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer: Moving
> container
> /data11/ozone-ec/data/storage/hdds/CID-9090d68f-eb34-44f0-b54f-10df5e42a347/current/containerDir12/6595
> to state UNHEALTHY from state:UNHEALTHY
> Trace:java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1060)
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.markContainerUnhealthy(KeyValueContainer.java:340)
>
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.markContainerUnhealthy(KeyValueHandler.java:1017)
>
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.markContainerUnhealthy(ContainerController.java:116)
>
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.scanContainer(ContainerDataScanner.java:72)
>
> org.apache.hadoop.ozone.container.ozoneimpl.AbstractContainerScanner.scanContainers(AbstractContainerScanner.java:99)
>
> org.apache.hadoop.ozone.container.ozoneimpl.AbstractContainerScanner.runIteration(AbstractContainerScanner.java:74)
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]