[ 
https://issues.apache.org/jira/browse/HUDI-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-2792:
---------------------------------
    Labels: pull-request-available  (was: )

> Metadata table enters into inconsistent state
> ---------------------------------------------
>
>                 Key: HUDI-2792
>                 URL: https://issues.apache.org/jira/browse/HUDI-2792
>             Project: Apache Hudi
>          Issue Type: Sub-task
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.10.0
>
>
> I see we have validations to ensure metadata table is in valid state. 
> Specifically, if a file is being deleted from metadata table which was never 
> added, we throw an exception. 
> {code:java}
> if (hoodieRecord.isPresent()) {
>   if (!hoodieRecord.get().getData().getDeletions().isEmpty()) {
>     throw new HoodieMetadataException("Metadata partition list record is 
> inconsistent: "
>         + hoodieRecord.get().getData());
>   }
>   partitions = hoodieRecord.get().getData().getFilenames();
>   // Partition-less tables have a single empty partition
>   if (partitions.contains(NON_PARTITIONED_NAME)) {
>     partitions.remove(NON_PARTITIONED_NAME);
>     partitions.add("");
>   }
> } {code}
> i.e. after merging all log records, if for a particular record (partition), 
> if there are valid files(i.e. record is not empty), and if there is a delete 
> list as well, we fail. 
>  
> I could able to reproduce this issue in one of my test scenario. Even though 
> the actual test case is bit tangential, here is the convincing case which 
> requires relaxing this constraint. 
>  
> Due to spark task failures, there could be more files in the system than 
> being tracked in the commit metadata. so, if a user tries to rollback a 
> completed write(which had some spark task failures), the rollback will have 
> more files compared to the initial set of files added as part of commit 
> metadata.
> So, we are in need of relaxing this constraint (if a file was deleted from 
> metadata table which was never added, we throw an exception). If not, I 
> cannot think of a way to get around this. 
>  
> also, need to bring to one's attention on 
> [https://github.com/apache/hudi/pull/3678.] this patch addresses a case, when 
> clean or rollback failed mid-way, and then re-attempted, we want to capture 
> all files that got deleted on the whole (1st attempt and 2nd attempt) while 
> applying the changes to metadata table. Need to think through how this pan 
> out wrt the constraint we have in metadata payload. I couldn't think of any 
> gaps here, but just wanted to remind if someone can think of anything. 
>  
> Trying to get ideas on how to go about this. Can we add some minimal 
> constraint, but loosen up the existing one so that we support the spark task 
> failure cases. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to