[
https://issues.apache.org/jira/browse/HUDI-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Guo updated HUDI-7655:
----------------------------
Fix Version/s: 0.15.0
> Support configuration for clean to fail execution if there is at least one
> file is marked as a failed delete
> ------------------------------------------------------------------------------------------------------------
>
> Key: HUDI-7655
> URL: https://issues.apache.org/jira/browse/HUDI-7655
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Krishen Bhan
> Assignee: sivabalan narayanan
> Priority: Minor
> Labels: clean, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> When a HUDI clean plan is executed, any targeted file that was not confirmed
> as deleted (or non-existing) will be marked as a "failed delete". Although
> these failed deletes will be added to `.clean` metadata, if incremental clean
> is used then these files might not ever be picked up again as a future clean
> plan, unless a "full-scan" clean ends up being scheduled. In addition to
> leading to more files unnecessarily taking up storage space for longer, then
> can lead to the following dataset consistency issue for COW datasets:
> # Insert at C1 creates file group f1 in partition
> # Replacecommit at RC2 creates file group f2 in partition, and replaces f1
> # Any reader of partition that calls HUDI API (with or without using MDT)
> will recognize that f1 should be ignored, as it has been replaced. This is
> since RC2 instant file is in active timeline
> # Some completed instants later an incremental clean is scheduled. It moves
> the "earliest commit to retain" to an time after instant time RC2, so it
> targets f1 for deletion. But during execution of the plan, it fails to delete
> f1.
> # An archive job eventually is triggered, and archives C1 and RC2. Note that
> f1 is still in partition
> At this point, any job/query that reads the aforementioned partition directly
> from the DFS file system calls (without directly using MDT FILES partition)
> will consider both f1 and f2 as valid file groups, since RC2 is no longer in
> active timeline. This is a data consistency issue, and will only be resolved
> if a "full-scan" clean is triggered and deletes f1.
> This specific scenario can be avoided if the user can configure HUDI clean to
> fail execution of a clean plan unless all files are confirmed as deleted (or
> not existing in DFS already), "blocking" the clean. The next clean attempt
> will re-execute this existing plan, since clean plans cannot be "rolled
> back".
--
This message was sent by Atlassian Jira
(v8.20.10#820010)