[ 
https://issues.apache.org/jira/browse/HDDS-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-7300:
---------------------------------
    Labels: pull-request-available  (was: )

> Conflict between full data scan and block deletion
> --------------------------------------------------
>
>                 Key: HDDS-7300
>                 URL: https://issues.apache.org/jira/browse/HDDS-7300
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Xu Shao Hong
>            Assignee: Xu Shao Hong
>            Priority: Major
>              Labels: pull-request-available
>
> We have enabled the full data scan and found that one container is marked as 
> unhealthy due to the conflict between full data scan and block deletion.
> The block deleting service first deletes the block and then updates the DB, 
> while the data scan first scans the DB and then checks the existence of the 
> blocks. 
> Once getting the DB record and finds the block not existing in the FS, the 
> `Missing chunk file exception` will be thrown and the container will be 
> marked as unhealthy.
>  
> *The block deleting service has a write lock during the process but the data 
> scan has no read lock to avoid the conflict.*
> Even by double checking the block if the block is still in the block-data 
> table when the block is not found on the FS for the first time, the problem 
> still happens. The flush time of DB batch operation is not predictable, so 
> the direct second retrieval may not be a good solution as we cannot determine 
> a fixed delay that could guarantee every batch could be flushed after this 
> delay.
>  
> *The log trace:*
>  * 2022-09-30 16:07:38,535 [BlockDeletingService#5] INFO 
> org.apache.hadoop.ozone.container.keyvalue.impl.FilePerBlockStrategy: Deleted 
> block file: 
> /data11/ozone-ec/data/storage/hdds/CID-9090d68f-eb34-44f0-b54f-10df5e42a347/current/containerDir12/6595/chunks/109611004723333878.block
>  
>  * 2022-09-30 16:07:39,244 
> [ContainerDataScanner(/data11/ozone-ec/data/storage/hdds)] ERROR 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck: Corruption 
> detected in container: [6595] Exception: [Missing chunk file 
> /data11/ozone-ec/data/storage/hdds/CID-9090d68f-eb34-44f0-b54f-10df5e42a347/current/containerDir12/6595/chunks/109611004723333878.block]
>  
>  * 2022-09-30 16:07:39,545 
> [ContainerDataScanner(/data11/ozone-ec/data/storage/hdds)] WARN 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer: Moving 
> container 
> /data11/ozone-ec/data/storage/hdds/CID-9090d68f-eb34-44f0-b54f-10df5e42a347/current/containerDir12/6595
>  to state UNHEALTHY from state:UNHEALTHY 
> Trace:java.lang.Thread.getStackTrace(Thread.java:1559) 
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1060) 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.markContainerUnhealthy(KeyValueContainer.java:340)
>  
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.markContainerUnhealthy(KeyValueHandler.java:1017)
>  
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.markContainerUnhealthy(ContainerController.java:116)
>  
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.scanContainer(ContainerDataScanner.java:72)
>  
> org.apache.hadoop.ozone.container.ozoneimpl.AbstractContainerScanner.scanContainers(AbstractContainerScanner.java:99)
>  
> org.apache.hadoop.ozone.container.ozoneimpl.AbstractContainerScanner.runIteration(AbstractContainerScanner.java:74)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to