[
https://issues.apache.org/jira/browse/HUDI-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Guo updated HUDI-6339:
----------------------------
Component/s: table-service
> Ability to Disable Partition Deletes during Clean
> -------------------------------------------------
>
> Key: HUDI-6339
> URL: https://issues.apache.org/jira/browse/HUDI-6339
> Project: Apache Hudi
> Issue Type: Improvement
> Components: cleaning, table-service
> Reporter: Dave Hagman
> Assignee: Dave Hagman
> Priority: Critical
> Labels: data-integrity
>
> We recently experienced a large data loss in one of our largest Hudi tables.
> We observed that entire partitions in our table were being deleted but we
> were initially unsure why. After a deep analysis of the code, we traced it to
> the Cleaning service, specifically the logic which decides whether a given
> partition is empty. We are running Hudi 0.12.3 so this is the link to the
> code I'm referencing:
> [https://github.com/apache/hudi/blob/release-0.12.3/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L370]
>
> The root cause of our issue is that we are using the Metadata Table (MDT) and
> it became inconsistent with the underlying filesystem somehow (we are unsure
> of the root cause). We did not have any auditing for the MDT to alert us to
> inconsistencies so the MDT remained in this state for a considerable amount
> of time.
> Because of the inconsistencies, there were many partitions that existed on
> disk but did not exist in the MDT. A full, non-incremental clean was run on
> the table which caused the Cleaner to scan all partitions in the table and
> compare what was on disk with what was in the MDT. The cleaner mistakenly
> considered all of the partitions that were on disk to be empty (even though
> they were not) and proceeded to perform a recursive delete of all those
> partitions.
> Due to the high-risk nature of partition deletes, I propose a configuration
> which allows Hudi operators to disable partition deletes on critical tables
> where deleting entire partitions is never desired. This aligns with all of
> our time-series Hudi tables.
>
> NOTE: I see that there are some improvements to the logic which determines an
> empty partition in the Master branch (not yet released). These improvements
> are great but due to the high-risk nature of these partition deletes, I still
> propose that an addition configuration be added so that users can fully
> disable partition deletes against tables that should never experience those.
> Recent changes:
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L392
--
This message was sent by Atlassian Jira
(v8.20.10#820010)