[
https://issues.apache.org/jira/browse/HUDI-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan updated HUDI-2792:
--------------------------------------
Description:
I see we have validations to ensure metadata table is in valid state.
Specifically, if a file is being deleted from metadata table which was never
added, we throw an exception.
{code:java}
if (hoodieRecord.isPresent()) {
if (!hoodieRecord.get().getData().getDeletions().isEmpty()) {
throw new HoodieMetadataException("Metadata partition list record is
inconsistent: "
+ hoodieRecord.get().getData());
}
partitions = hoodieRecord.get().getData().getFilenames();
// Partition-less tables have a single empty partition
if (partitions.contains(NON_PARTITIONED_NAME)) {
partitions.remove(NON_PARTITIONED_NAME);
partitions.add("");
}
} {code}
i.e. after merging all log records, if for a particular record (partition), if
there are valid files(i.e. record is not empty), and if there is a delete list
as well, we fail.
I could able to reproduce this issue in one of my test scenario. Even though
the actual test case is bit tangential, here is the convincing case which
requires relaxing this constraint.
Due to spark task failures, there could be more files in the system than being
tracked in the commit metadata. so, if a user tries to rollback a completed
write(which had some spark task failures), the rollback will have more files
compared to the initial set of files added as part of commit metadata.
So, we are in need of relaxing this constraint (if a file was deleted from
metadata table which was never added, we throw an exception). If not, I cannot
think of a way to get around this.
Trying to get ideas on how to go about this. Can we add some minimal
constraint, but loosen up the existing one so that we support the spark task
failure cases.
was:
I see we have validations to ensure metadata table is in valid state.
Specifically, if a file was deleted from metadata table which was never added,
we throw an exception.
I could able to reproduce this issue in one of my test scenario. Even though
the actual test case is bit tangential, here is the convincing case which
requires relaxing this constraint.
Due to spark task failures, there could be more files in the system than being
tracked in the commit metadata. so, if a user tries to rollback a completed
write(which had some spark task failures), the rollback will have more files
compared to the initial set of files added as part of commit metadata.
So, we are in need of relaxing this constraint (if a file was deleted from
metadata table which was never added, we throw an exception). If not, I cannot
think of a way to get around this.
Trying to get ideas on how to go about this. Can we add some minimal
constraint, but loosen up the existing one so that we support the spark task
failure cases.
> Metadata table enters into inconsistent state
> ---------------------------------------------
>
> Key: HUDI-2792
> URL: https://issues.apache.org/jira/browse/HUDI-2792
> Project: Apache Hudi
> Issue Type: Sub-task
> Reporter: sivabalan narayanan
> Assignee: sivabalan narayanan
> Priority: Blocker
> Fix For: 0.10.0
>
>
> I see we have validations to ensure metadata table is in valid state.
> Specifically, if a file is being deleted from metadata table which was never
> added, we throw an exception.
> {code:java}
> if (hoodieRecord.isPresent()) {
> if (!hoodieRecord.get().getData().getDeletions().isEmpty()) {
> throw new HoodieMetadataException("Metadata partition list record is
> inconsistent: "
> + hoodieRecord.get().getData());
> }
> partitions = hoodieRecord.get().getData().getFilenames();
> // Partition-less tables have a single empty partition
> if (partitions.contains(NON_PARTITIONED_NAME)) {
> partitions.remove(NON_PARTITIONED_NAME);
> partitions.add("");
> }
> } {code}
> i.e. after merging all log records, if for a particular record (partition),
> if there are valid files(i.e. record is not empty), and if there is a delete
> list as well, we fail.
>
> I could able to reproduce this issue in one of my test scenario. Even though
> the actual test case is bit tangential, here is the convincing case which
> requires relaxing this constraint.
>
> Due to spark task failures, there could be more files in the system than
> being tracked in the commit metadata. so, if a user tries to rollback a
> completed write(which had some spark task failures), the rollback will have
> more files compared to the initial set of files added as part of commit
> metadata.
> So, we are in need of relaxing this constraint (if a file was deleted from
> metadata table which was never added, we throw an exception). If not, I
> cannot think of a way to get around this.
>
> Trying to get ideas on how to go about this. Can we add some minimal
> constraint, but loosen up the existing one so that we support the spark task
> failure cases.
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)