[
https://issues.apache.org/jira/browse/HDDS-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xu Shao Hong updated HDDS-7300:
-------------------------------
Description:
We have enabled the full data scan and found that one container is marked as
unhealthy due to the conflict between full data scan and block deletion.
The block deleting service first deletes the block and then updates the DB,
while the data scan first scans the DB and then checks the existence of the
blocks.
Once getting the DB record and finding the block not existing in the FS, the
`Missing chunk file exception` will be thrown and the container will be marked
as unhealthy.
*The block deleting service has a write lock during the process but the data
scan has no read lock to avoid the conflict.*
Even by double checking the block if the block is still in the block-data table
when the block is not found on the FS for the first time, the problem still
happens. The flush time of DB batch operation is not predictable, so the direct
second retrieval may not be a good solution as we cannot determine a fixed
delay that could guarantee every batch could be flushed after this delay.
was:
We have enabled the full data scan and found that one container is marked as
unhealthy due to the conflict between full data scan and block deletion.
The block deleting service first deletes the block and then updates the DB,
while the data scan first scans the DB and then checks the existence of the
blocks.
*The block deleting service has a write lock during the process but the data
scan has no read lock to avoid the conflict.*
Even by double checking the block if the block is still in the block-data table
when the block is not found on the FS for the first time, the problem still
happens. The flush time of DB batch operation is not predictable, so the direct
second retrieval may not be a good solution as we cannot determine a fixed
delay that could guarantee every batch could be flushed after this delay.
> Conflict between full data scan and block deletion
> --------------------------------------------------
>
> Key: HDDS-7300
> URL: https://issues.apache.org/jira/browse/HDDS-7300
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Xu Shao Hong
> Assignee: Xu Shao Hong
> Priority: Major
>
> We have enabled the full data scan and found that one container is marked as
> unhealthy due to the conflict between full data scan and block deletion.
> The block deleting service first deletes the block and then updates the DB,
> while the data scan first scans the DB and then checks the existence of the
> blocks.
> Once getting the DB record and finding the block not existing in the FS, the
> `Missing chunk file exception` will be thrown and the container will be
> marked as unhealthy.
>
> *The block deleting service has a write lock during the process but the data
> scan has no read lock to avoid the conflict.*
> Even by double checking the block if the block is still in the block-data
> table when the block is not found on the FS for the first time, the problem
> still happens. The flush time of DB batch operation is not predictable, so
> the direct second retrieval may not be a good solution as we cannot determine
> a fixed delay that could guarantee every batch could be flushed after this
> delay.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]