[ 
https://issues.apache.org/jira/browse/HUDI-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2792:
--------------------------------------
    Description: 
I see we have validations to ensure metadata table is in valid state. 
Specifically, if a file is being deleted from metadata table which was never 
added, we throw an exception. 
{code:java}
if (hoodieRecord.isPresent()) {
  if (!hoodieRecord.get().getData().getDeletions().isEmpty()) {
    throw new HoodieMetadataException("Metadata partition list record is 
inconsistent: "
        + hoodieRecord.get().getData());
  }

  partitions = hoodieRecord.get().getData().getFilenames();
  // Partition-less tables have a single empty partition
  if (partitions.contains(NON_PARTITIONED_NAME)) {
    partitions.remove(NON_PARTITIONED_NAME);
    partitions.add("");
  }
} {code}
i.e. after merging all log records, if for a particular record (partition), if 
there are valid files(i.e. record is not empty), and if there is a delete list 
as well, we fail. 

 

I could able to reproduce this issue in one of my test scenario. Even though 
the actual test case is bit tangential, here is the convincing case which 
requires relaxing this constraint. 

 

Due to spark task failures, there could be more files in the system than being 
tracked in the commit metadata. so, if a user tries to rollback a completed 
write(which had some spark task failures), the rollback will have more files 
compared to the initial set of files added as part of commit metadata.

So, we are in need of relaxing this constraint (if a file was deleted from 
metadata table which was never added, we throw an exception). If not, I cannot 
think of a way to get around this. 

 

also, need to bring to one's attention on 
[https://github.com/apache/hudi/pull/3678.] this patch addresses a case, when 
clean or rollback failed mid-way, and then re-attempted, we want to capture all 
files that got deleted on the whole (1st attempt and 2nd attempt) while 
applying the changes to metadata table. Need to think through how this pan out 
wrt the constraint we have in metadata payload. I couldn't think of any gaps 
here, but just wanted to remind if someone can think of anything. 

 

Trying to get ideas on how to go about this. Can we add some minimal 
constraint, but loosen up the existing one so that we support the spark task 
failure cases. 

 

 

 

 

 

 

  was:
I see we have validations to ensure metadata table is in valid state. 
Specifically, if a file is being deleted from metadata table which was never 
added, we throw an exception. 
{code:java}
if (hoodieRecord.isPresent()) {
  if (!hoodieRecord.get().getData().getDeletions().isEmpty()) {
    throw new HoodieMetadataException("Metadata partition list record is 
inconsistent: "
        + hoodieRecord.get().getData());
  }

  partitions = hoodieRecord.get().getData().getFilenames();
  // Partition-less tables have a single empty partition
  if (partitions.contains(NON_PARTITIONED_NAME)) {
    partitions.remove(NON_PARTITIONED_NAME);
    partitions.add("");
  }
} {code}
i.e. after merging all log records, if for a particular record (partition), if 
there are valid files(i.e. record is not empty), and if there is a delete list 
as well, we fail. 

 

I could able to reproduce this issue in one of my test scenario. Even though 
the actual test case is bit tangential, here is the convincing case which 
requires relaxing this constraint. 

 

Due to spark task failures, there could be more files in the system than being 
tracked in the commit metadata. so, if a user tries to rollback a completed 
write(which had some spark task failures), the rollback will have more files 
compared to the initial set of files added as part of commit metadata.

So, we are in need of relaxing this constraint (if a file was deleted from 
metadata table which was never added, we throw an exception). If not, I cannot 
think of a way to get around this. 

 

Trying to get ideas on how to go about this. Can we add some minimal 
constraint, but loosen up the existing one so that we support the spark task 
failure cases. 

 

 

 

 

 

 


> Metadata table enters into inconsistent state
> ---------------------------------------------
>
>                 Key: HUDI-2792
>                 URL: https://issues.apache.org/jira/browse/HUDI-2792
>             Project: Apache Hudi
>          Issue Type: Sub-task
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Blocker
>             Fix For: 0.10.0
>
>
> I see we have validations to ensure metadata table is in valid state. 
> Specifically, if a file is being deleted from metadata table which was never 
> added, we throw an exception. 
> {code:java}
> if (hoodieRecord.isPresent()) {
>   if (!hoodieRecord.get().getData().getDeletions().isEmpty()) {
>     throw new HoodieMetadataException("Metadata partition list record is 
> inconsistent: "
>         + hoodieRecord.get().getData());
>   }
>   partitions = hoodieRecord.get().getData().getFilenames();
>   // Partition-less tables have a single empty partition
>   if (partitions.contains(NON_PARTITIONED_NAME)) {
>     partitions.remove(NON_PARTITIONED_NAME);
>     partitions.add("");
>   }
> } {code}
> i.e. after merging all log records, if for a particular record (partition), 
> if there are valid files(i.e. record is not empty), and if there is a delete 
> list as well, we fail. 
>  
> I could able to reproduce this issue in one of my test scenario. Even though 
> the actual test case is bit tangential, here is the convincing case which 
> requires relaxing this constraint. 
>  
> Due to spark task failures, there could be more files in the system than 
> being tracked in the commit metadata. so, if a user tries to rollback a 
> completed write(which had some spark task failures), the rollback will have 
> more files compared to the initial set of files added as part of commit 
> metadata.
> So, we are in need of relaxing this constraint (if a file was deleted from 
> metadata table which was never added, we throw an exception). If not, I 
> cannot think of a way to get around this. 
>  
> also, need to bring to one's attention on 
> [https://github.com/apache/hudi/pull/3678.] this patch addresses a case, when 
> clean or rollback failed mid-way, and then re-attempted, we want to capture 
> all files that got deleted on the whole (1st attempt and 2nd attempt) while 
> applying the changes to metadata table. Need to think through how this pan 
> out wrt the constraint we have in metadata payload. I couldn't think of any 
> gaps here, but just wanted to remind if someone can think of anything. 
>  
> Trying to get ideas on how to go about this. Can we add some minimal 
> constraint, but loosen up the existing one so that we support the spark task 
> failure cases. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to