[ 
https://issues.apache.org/jira/browse/HDDS-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Joshi reassigned HDDS-6449:
--------------------------------

    Assignee: Neil Joshi

> Failed container delete can leave artifacts on disk
> ---------------------------------------------------
>
>                 Key: HDDS-6449
>                 URL: https://issues.apache.org/jira/browse/HDDS-6449
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>    Affects Versions: 1.0.0, 1.1.0, 1.2.0
>            Reporter: Ethan Rose
>            Assignee: Neil Joshi
>            Priority: Major
>
> When SCM issues a delete command to a datanode, the datanode does the 
> following steps:
> writeLock()
>     1. The container is removed from the in memory container set.
> writeUnlock()
>     2. The container metadata directory recursively deleted.
>     3. The container chunks directory recursively deleted.
>     4. The datanode sets the container's in memory state to DELETED
>         - This is purely for the ICR as the container is not present in the 
> container set anymore.
>     5. Datanode sends incremental container report to SCM with the new state.
>         - The container has been removed from the in-memory set at this 
> point, so once the ICR is sent the container is unreachable.
> In HDDS-6441, A failure in step 2 removed the .container file and 
> db.checkpoints directory (unused) from the metadata directory, and the rest 
> of the steps were not done after the IO exception was thrown during the 
> delete. This caused an error to be logged when the partial state was read on 
> datanode restart.
> This current method of deleting containers provides no way to recover from or 
> retry a failed delete, because the container is removed from the in-memory 
> set as the first step. This Jira aims to change the datanode delete steps so 
> that if a delete fails, the existing SCM container delete retry logic or the 
> datanode itself can eventually get the lingering state off the disk.
>  
> Provided link to sharable google doc for potential solution "to resolve the 
> datanode artifact issue by using a background failedContainerDelete thread 
> that is run on each datanode to cleanup failed container delete 
> transactions.":
> https://docs.google.com/document/d/1ngRCbA_HxoNOof1kaiDuw0XYjJ2Z7t64ATF-V0TsJ-4/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to