[ 
https://issues.apache.org/jira/browse/HDDS-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541764#comment-17541764
 ] 

Neil Joshi commented on HDDS-6449:
----------------------------------

In the keyValueHandler.handleDeleteContainer datanode delete path, the 4 items 
are deleted including DB, metadata and chunks.  During the delete process an 
IOException can be raised by the deletion of the metadata, chunks or container 
directory. 

Should an exception be raised, it looks to result to the condition this Jira 
was opened to handle, unaccounted artifacts left on the disk.  A way of 
handling this is that on handling the exception in the delete path, to enqueue 
the failed container delete transaction to be handled by a dedicated thread 
service for removing artifacts on datanodes (failed container deletions).

> Failed container delete can leave artifacts on disk
> ---------------------------------------------------
>
>                 Key: HDDS-6449
>                 URL: https://issues.apache.org/jira/browse/HDDS-6449
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>    Affects Versions: 1.0.0, 1.1.0, 1.2.0
>            Reporter: Ethan Rose
>            Priority: Major
>
> When SCM issues a delete command to a datanode, the datanode does the 
> following steps:
> writeLock()
>     1. The container is removed from the in memory container set.
> writeUnlock()
>     2. The container metadata directory recursively deleted.
>     3. The container chunks directory recursively deleted.
>     4. The datanode sets the container's in memory state to DELETED
>         - This is purely for the ICR as the container is not present in the 
> container set anymore.
>     5. Datanode sends incremental container report to SCM with the new state.
>         - The container has been removed from the in-memory set at this 
> point, so once the ICR is sent the container is unreachable.
> In HDDS-6441, A failure in step 2 removed the .container file and 
> db.checkpoints directory (unused) from the metadata directory, and the rest 
> of the steps were not done after the IO exception was thrown during the 
> delete. This caused an error to be logged when the partial state was read on 
> datanode restart.
> This current method of deleting containers provides no way to recover from or 
> retry a failed delete, because the container is removed from the in-memory 
> set as the first step. This Jira aims to change the datanode delete steps so 
> that if a delete fails, the existing SCM container delete retry logic or the 
> datanode itself can eventually get the lingering state off the disk.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

Reply via email to