[ 
https://issues.apache.org/jira/browse/HUDI-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Hagman updated HUDI-6339:
------------------------------
    Description: 
We recently experienced a large data loss in one of our largest Hudi tables. We 
observed that entire partitions in our table were being deleted but we were 
initially unsure why. After a deep analysis of the code, we traced it to the 
Cleaning service, specifically the logic which decides whether a given 
partition is empty. We are running Hudi 0.12.3 so this is the link to the code 
I'm referencing:

[https://github.com/apache/hudi/blob/release-0.12.3/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L370]

 

The root cause of our issue is that we are using the Metadata Table (MDT) and 
it became inconsistent with the underlying filesystem somehow (we are unsure of 
the root cause). We did not have any auditing for the MDT to alert us to 
inconsistencies so the MDT remained in this state for a considerable amount of 
time.

Because of the inconsistencies, there were many partitions that existed on disk 
but did not exist in the MDT. A full, non-incremental clean was run on the 
table which caused the Cleaner to scan all partitions in the table and compare 
what was on disk with what was in the MDT. The cleaner mistakenly considered 
all of the partitions that were on disk to be empty (even though they were not) 
and proceeded to perform a recursive delete of all those partitions.

Due to the high-risk nature of partition deletes, I propose a configuration 
which allows Hudi operators to disable partition deletes on critical tables 
where deleting entire partitions is never desired. This aligns with all of our 
time-series Hudi tables.

 

NOTE: I see that there are some improvements to the logic which determines an 
empty partition in the Master branch (not yet released). These improvements are 
great but due to the high-risk nature of these partition deletes, I still 
propose that an addition configuration be added so that users can fully disable 
partition deletes against tables that should never experience those.

Recent changes: 
https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L392

  was:
We recently experienced a large data loss in one of our largest Hudi tables. We 
observed that entire partitions in our table were being deleted but we were 
initially unsure why. After a deep analysis of the code, we traced it to the 
Cleaning service, specifically the logic which decides whether a given 
partition is empty. We are running Hudi 0.12.3 so this is the link to the code 
I'm referencing:

[https://github.com/apache/hudi/blob/release-0.12.3/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L370]

 

The root cause of our issue is that we are using the Metadata Table (MDT) and 
it became inconsistent with the underlying filesystem somehow (we are unsure of 
the root cause). We did not have any auditing for the MDT to alert us to 
inconsistencies so the MDT remained in this state for a considerable amount of 
time.

Because of the inconsistencies, there were many partitions that existed on disk 
but did not exist in the MDT. A full, non-incremental clean was run on the 
table which caused the Cleaner to scan all partitions in the table and compare 
what was on disk with what was in the MDT. The cleaner mistakenly considered 
all of the partitions that were on disk to be empty (even though they were not) 
and proceeded to perform a recursive delete of all those partitions.

Due to the high-risk nature of partition deletes, I propose a configuration 
which allows Hudi operators to disable partition deletes on critical tables 
where deleting entire partitions is never desired. This aligns with all of our 
time-series Hudi tables.

 

NOTE: I see that there are some improvements to the logic which determines an 
empty partition in the Master branch (not yet released). These improvements are 
great but due to the high-risk nature of these partition deletes, I still 
propose that an addition configuration be added so that users can fully disable 
partition deletes against tables that should never experience those.

 


> Ability to Disable Partition Deletes during Clean
> -------------------------------------------------
>
>                 Key: HUDI-6339
>                 URL: https://issues.apache.org/jira/browse/HUDI-6339
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: cleaning
>            Reporter: Dave Hagman
>            Assignee: Dave Hagman
>            Priority: Critical
>              Labels: data-integrity
>
> We recently experienced a large data loss in one of our largest Hudi tables. 
> We observed that entire partitions in our table were being deleted but we 
> were initially unsure why. After a deep analysis of the code, we traced it to 
> the Cleaning service, specifically the logic which decides whether a given 
> partition is empty. We are running Hudi 0.12.3 so this is the link to the 
> code I'm referencing:
> [https://github.com/apache/hudi/blob/release-0.12.3/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L370]
>  
> The root cause of our issue is that we are using the Metadata Table (MDT) and 
> it became inconsistent with the underlying filesystem somehow (we are unsure 
> of the root cause). We did not have any auditing for the MDT to alert us to 
> inconsistencies so the MDT remained in this state for a considerable amount 
> of time.
> Because of the inconsistencies, there were many partitions that existed on 
> disk but did not exist in the MDT. A full, non-incremental clean was run on 
> the table which caused the Cleaner to scan all partitions in the table and 
> compare what was on disk with what was in the MDT. The cleaner mistakenly 
> considered all of the partitions that were on disk to be empty (even though 
> they were not) and proceeded to perform a recursive delete of all those 
> partitions.
> Due to the high-risk nature of partition deletes, I propose a configuration 
> which allows Hudi operators to disable partition deletes on critical tables 
> where deleting entire partitions is never desired. This aligns with all of 
> our time-series Hudi tables.
>  
> NOTE: I see that there are some improvements to the logic which determines an 
> empty partition in the Master branch (not yet released). These improvements 
> are great but due to the high-risk nature of these partition deletes, I still 
> propose that an addition configuration be added so that users can fully 
> disable partition deletes against tables that should never experience those.
> Recent changes: 
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L392



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to