[GitHub] [hudi] majian1998 opened a new pull request, #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

via GitHub Thu, 17 Aug 2023 21:45:40 -0700


majian1998 opened a new pull request, #9472:
URL: https://github.com/apache/hudi/pull/9472

### Change Logs
Implemented a solution to prevent duplicate key errors in concurrent replace
commit operations.
Registered the replace file ID information to the timeline for each replace
commit, whether it is submitted or not.
Saved the pending replace commit's partitionToReplaceFileIds information to
the inflight commit's extra metadata.
Updated drop partition and clustering operations to read the
partitionToReplaceFileIds information in the timeline to ensure no duplicate
file IDs.
Removed the modification involved in the related issue for compaction commit.

### Impact

No public API or user-facing feature changes.

### Risk level (write none, low medium or high below)

low

### Documentation Update

Related issue: https://issues.apache.org/jira/browse/HUDI-5553

The specific problem is that when concurrent replace commit operations are
executed, two replace commits may point to the same file ID, resulting in a
duplicate key error. The existing issue solved the problem of scheduling delete
partition while there are pending clustering or compaction operations, which
will be prevented in this case. However, this solution is not perfect and may
still cause data inconsistency if a clustering plan is scheduled before the
delete partition is committed. Because validation is one-way.In this case, both
replace commits will still contain duplicate keys, and the table will become
inconsistent when both plans are committed. This is very fatal, and there are
other similar scenarios that may bypass the validation of the existing issue.
Moreover, the existing issue is at the partition level and is not precise
enough.

Here is my solution:

![image](https://github.com/apache/hudi/assets/47964462/6d8a3134-96a5-45ec-8ed0-ed2776b7ed24)

As shown in the figure, both drop partition and clustering will go through a
period of time that is not registered to the timeline, which is the scenario
that the previous issue did not solve. Here, I register the replace file IDs
involved in each replace commit to the active timeline (the replace commit
timeline that has been submitted has saved partitionToReplaceFileIds, and only
pending requests need to be processed). Since in the case of Spark SQL, delete
partition creates a requested commit in advance during write, which is
inconvenient to handle, I save the pending replace commit's
partitionToReplaceFileIds information to the inflight commit's extra metadata.
Therefore, each time drop partition or clustering is executed, it only needs to
read the partitionToReplaceFileIds information in the timeline after ensuring
that the inflight commit information has been saved to the timeline to ensure
that there are no duplicate file IDs and prevent this kind of error from
occurring.

In simple terms, each replace commit will register the replace file ID
information to the timeline whether it is submitted or not, at the same time,
each submission will check this information to ensure that it will not be
repeated, so that any replace commit containing this file ID will be prevented,
ensuring that there are no duplicate keys.

When this idea is also implemented on the compaction commit, the
modification involved in the related issue can be removed.

### Contributor's checklist

- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] majian1998 opened a new pull request, #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

Reply via email to