[
https://issues.apache.org/jira/browse/HDDS-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neil Joshi reassigned HDDS-6449:
--------------------------------
Assignee: Neil Joshi
> Failed container delete can leave artifacts on disk
> ---------------------------------------------------
>
> Key: HDDS-6449
> URL: https://issues.apache.org/jira/browse/HDDS-6449
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode
> Affects Versions: 1.0.0, 1.1.0, 1.2.0
> Reporter: Ethan Rose
> Assignee: Neil Joshi
> Priority: Major
>
> When SCM issues a delete command to a datanode, the datanode does the
> following steps:
> writeLock()
> 1. The container is removed from the in memory container set.
> writeUnlock()
> 2. The container metadata directory recursively deleted.
> 3. The container chunks directory recursively deleted.
> 4. The datanode sets the container's in memory state to DELETED
> - This is purely for the ICR as the container is not present in the
> container set anymore.
> 5. Datanode sends incremental container report to SCM with the new state.
> - The container has been removed from the in-memory set at this
> point, so once the ICR is sent the container is unreachable.
> In HDDS-6441, A failure in step 2 removed the .container file and
> db.checkpoints directory (unused) from the metadata directory, and the rest
> of the steps were not done after the IO exception was thrown during the
> delete. This caused an error to be logged when the partial state was read on
> datanode restart.
> This current method of deleting containers provides no way to recover from or
> retry a failed delete, because the container is removed from the in-memory
> set as the first step. This Jira aims to change the datanode delete steps so
> that if a delete fails, the existing SCM container delete retry logic or the
> datanode itself can eventually get the lingering state off the disk.
>
> Provided link to sharable google doc for potential solution "to resolve the
> datanode artifact issue by using a background failedContainerDelete thread
> that is run on each datanode to cleanup failed container delete
> transactions.":
> https://docs.google.com/document/d/1ngRCbA_HxoNOof1kaiDuw0XYjJ2Z7t64ATF-V0TsJ-4/edit?usp=sharing
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]