Dave Hagman created HUDI-6339:
---------------------------------
Summary: Ability to Disable Partition Deletes during Clean
Key: HUDI-6339
URL: https://issues.apache.org/jira/browse/HUDI-6339
Project: Apache Hudi
Issue Type: Improvement
Components: cleaning
Reporter: Dave Hagman
Assignee: Dave Hagman
We recently experienced a large data loss in one of our largest Hudi tables. We
observed that entire partitions in our table were being deleted but we were
initially unsure why. After a deep analysis of the code, we traced it to the
Cleaning service, specifically the logic which decides whether a given
partition is empty. We are running Hudi 0.12.3 so this is the link to the code
I'm referencing:
[https://github.com/apache/hudi/blob/release-0.12.3/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L370]
The root cause of our issue is that we are using the Metadata Table (MDT) and
it became inconsistent with the underlying filesystem somehow (we are unsure of
the root cause). We did not have any auditing for the MDT to alert us to
inconsistencies so the MDT remained in this state for a considerable amount of
time.
Because of the inconsistencies, there were many partitions that existed on disk
but did not exist in the MDT. A full, non-incremental clean was run on the
table which caused the Cleaner to scan all partitions in the table and compare
what was on disk with what was in the MDT. The cleaner mistakenly considered
all of the partitions that were on disk to be empty (even though they were not)
and proceeded to perform a recursive delete of all those partitions.
Due to the high-risk nature of partition deletes, I propose a configuration
which allows Hudi operators to disable partition deletes on critical tables
where deleting entire partitions is never desired. This aligns with all of our
time-series Hudi tables.
NOTE: I see that there are some improvements to the logic which determines an
empty partition in the Master branch (not yet released). These improvements are
great but due to the high-risk nature of these partition deletes, I still
propose that an addition configuration be added so that users can fully disable
partition deletes against tables that should never experience those.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)