[
https://issues.apache.org/jira/browse/HUDI-6719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Y Ethan Guo updated HUDI-6719:
------------------------------
Fix Version/s: 1.0.2
> Fix data inconsistency issues caused by concurrent clustering and delete
> partition.
> -----------------------------------------------------------------------------------
>
> Key: HUDI-6719
> URL: https://issues.apache.org/jira/browse/HUDI-6719
> Project: Apache Hudi
> Issue Type: Bug
> Components: clustering, table-service
> Reporter: Ma Jian
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.0.2
>
>
> Related issue: https://issues.apache.org/jira/browse/HUDI-5553
> The specific problem is that when concurrent replace commit operations are
> executed, two replace commits may point to the same file ID, resulting in a
> duplicate key error. The existing issue solved the problem of scheduling
> delete partition while there are pending clustering or compaction operations,
> which will be prevented in this case. However, this solution is not perfect
> and may still cause data inconsistency if a clustering plan is scheduled
> before the delete partition is committed. Because validation is one-way.In
> this case, both replace commits will still contain duplicate keys, and the
> table will become inconsistent when both plans are committed. This is very
> fatal, and there are other similar scenarios that may bypass the validation
> of the existing issue. Moreover, the existing issue is at the partition level
> and is not precise enough.
> Here is my solution:
> !https://intranetproxy.alipay.com/skylark/lark/0/2023/png/62256341/1692328998008-f9dc6530-e44e-43e7-9b75-d760b55b3dfa.png|width=335,id=WXCCX!
> As shown in the figure, both drop partition and clustering will go through a
> period of time that is not registered to the timeline, which is the scenario
> that the previous issue did not solve. Here, I register the replace file IDs
> involved in each replace commit to the active timeline (the replace commit
> timeline that has been submitted has saved partitionToReplaceFileIds, and
> only pending requests need to be processed). Since in the case of Spark SQL,
> delete partition creates a requested commit in advance during write, which is
> inconvenient to handle, I save the pending replace commit's
> partitionToReplaceFileIds information to the inflight commit's extra
> metadata. Therefore, each time drop partition or clustering is executed, it
> only needs to read the partitionToReplaceFileIds information in the timeline
> after ensuring that the inflight commit information has been saved to the
> timeline to ensure that there are no duplicate file IDs and prevent this kind
> of error from occurring.
> In simple terms, each replace commit will register the replace file ID
> information to the timeline whether it is submitted or not, at the same time,
> each submission will check this information to ensure that it will not be
> repeated, so that any replace commit containing this file ID will be
> prevented, ensuring that there are no duplicate keys.
> When this idea is also implemented on the compaction commit, the modification
> involved in the related issue can be removed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)