[
https://issues.apache.org/jira/browse/HDDS-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729853#comment-17729853
]
Ethan Rose commented on HDDS-8770:
----------------------------------
Some possible solutions I came up with:
h2. Leave the orphaned entries in RocksDB
With this solution, we would always do the container directory operation first
as an atomic move, followed by the DB update. If there is a failure between the
two steps, we would just leave the entries in the DB.
h3. Pros
* tmp directory used for deletes (and in the future for imports) can be cleared
out without a DB update.
* Disk used by startup and scanner to index containers, so extra RocksDB
entries will not cause incorrect behavior.
h3. Cons
* No method to clean up the extra entries in the DB. Since this is a rare
failure, there may not be many of them, but they will occupy that space for the
lifetime of the disk.
h2. Move the container to the DELETED state before starting the delete process
As the first step in the delete process, we could move the container to the
DELETED state in the container file. Then we could delete the DB entries and
move the container to the deleted containers directory. If the DB update
succeeds but the move fails, the datanode will see that the container is in the
DELETED state on startup. It will know that the RocksDB entries have already
been cleared and the directory can be deleted.
h3. Pros
* tmp directory used for deletes (and in the future for imports) can be cleared
out without a DB update.
* DB entries are cleared out in failure scenarios.
h3. Cons
* This may make it impossible to delete some unhealthy containers. If the
container file is corrupted, the update may fail. However, this update is
required to proceed with the delete process.
h2. Check if container is active before before removing DB entries when
clearing the deleted containers directory on startup/shutdown
This would keep the current delete flow, but when the deleted containers
directory is cleared, we would check the in memory container set to see if
there is an active version of this container. If so, skip clearing the DB.
h3. Pros
* Simple update to the existing flow.
h3. Cons
* Clearing the deleted containers directory still requires a DB update.
* Clearing the deleted containers directory also requires access to the in
memory container set.
> Cleanup of failed container delete may remove datanode RocksDB entries of
> active container
> ------------------------------------------------------------------------------------------
>
> Key: HDDS-8770
> URL: https://issues.apache.org/jira/browse/HDDS-8770
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: Ozone Datanode
> Reporter: Ethan Rose
> Priority: Major
>
> Now that container schema v3 has been implemented, container level updates
> like delete and import require both moving the container directory, and
> editing the container's entries in RocksDB.
> Originally in commit bf5b6f5 the container delete steps were:
> 1. Remove entries from RocksDB
> 2. Delete container directory
> In this implementation, it is possible that the RocksDB update succeeds but
> the container delete fails, leaving behind a container directory on the disk
> that is discovered at startup. The datanode would load the container and
> recalculate only the metadata values
> (KeyValueContianerUtil#verifyAndFixupContainerData). Delete transaction and
> block data would be lost, leaving this container corrupted, but reported as
> healthy to SCM until the scanner identifies it.
> After HDDS-6449, the steps were changed so that failed directory deletes
> would not leave broken container directories that the datanode discovers on
> startup. The deletion steps became:
> 1. Move container directory to tmp deleted containers directory on the same
> file system (atomic).
> 2. Delete DB entries
> 3. Delete container from tmp directory.
> The deleted container directory will be cleared on datanode startup and
> shutdown, and this process will also clear corresponding RocksDB entries that
> may not have been cleared if an error happened after step 1. This can cause
> RocksDB data for an active container replica to be deleted incorrectly in the
> following case:
> 1. Container 1 is deleted. Rename of the container directory to the
> delete directory succeeds but DB update fails.
> 2. Container 1 is re-imported to the same datanode on the same volume.
> The imported SST files overwrite the old ones in the DB.
> 3. Datanode is restarted, triggering cleanup of the deleted container
> directory and RocksDB entries for any containers there.
> - This deletes data belonging to container ID 1, which now happens to
> belong to the active container.
> Container import can have similar issues as well. We need a standardized
> process to keep DB and directory updates consistent and recover from failures
> between the two operations.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]