[ 
https://issues.apache.org/jira/browse/HUDI-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6339:
----------------------------
    Component/s: table-service

> Ability to Disable Partition Deletes during Clean
> -------------------------------------------------
>
>                 Key: HUDI-6339
>                 URL: https://issues.apache.org/jira/browse/HUDI-6339
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: cleaning, table-service
>            Reporter: Dave Hagman
>            Assignee: Dave Hagman
>            Priority: Critical
>              Labels: data-integrity
>
> We recently experienced a large data loss in one of our largest Hudi tables. 
> We observed that entire partitions in our table were being deleted but we 
> were initially unsure why. After a deep analysis of the code, we traced it to 
> the Cleaning service, specifically the logic which decides whether a given 
> partition is empty. We are running Hudi 0.12.3 so this is the link to the 
> code I'm referencing:
> [https://github.com/apache/hudi/blob/release-0.12.3/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L370]
>  
> The root cause of our issue is that we are using the Metadata Table (MDT) and 
> it became inconsistent with the underlying filesystem somehow (we are unsure 
> of the root cause). We did not have any auditing for the MDT to alert us to 
> inconsistencies so the MDT remained in this state for a considerable amount 
> of time.
> Because of the inconsistencies, there were many partitions that existed on 
> disk but did not exist in the MDT. A full, non-incremental clean was run on 
> the table which caused the Cleaner to scan all partitions in the table and 
> compare what was on disk with what was in the MDT. The cleaner mistakenly 
> considered all of the partitions that were on disk to be empty (even though 
> they were not) and proceeded to perform a recursive delete of all those 
> partitions.
> Due to the high-risk nature of partition deletes, I propose a configuration 
> which allows Hudi operators to disable partition deletes on critical tables 
> where deleting entire partitions is never desired. This aligns with all of 
> our time-series Hudi tables.
>  
> NOTE: I see that there are some improvements to the logic which determines an 
> empty partition in the Master branch (not yet released). These improvements 
> are great but due to the high-risk nature of these partition deletes, I still 
> propose that an addition configuration be added so that users can fully 
> disable partition deletes against tables that should never experience those.
> Recent changes: 
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L392



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to