[jira] [Updated] (HUDI-6719) Fix data inconsistency issues caused by concurrent clustering and delete partition.

Ma Jian (Jira) Thu, 17 Aug 2023 21:38:25 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-6719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ma Jian updated HUDI-6719:
--------------------------
    Description: 
Related issue: https://issues.apache.org/jira/browse/HUDI-5553

The specific problem is that when concurrent replace commit operations are 
executed, two replace commits may point to the same file ID, resulting in a 
duplicate key error. The existing issue solved the problem of scheduling delete 
partition while there are pending clustering or compaction operations, which 
will be prevented in this case. However, this solution is not perfect and may 
still cause data inconsistency if a clustering plan is scheduled before the 
delete partition is committed. Because validation is one-way.In this case, both 
replace commits will still contain duplicate keys, and the table will become 
inconsistent when both plans are committed. This is very fatal, and there are 
other similar scenarios that may bypass the validation of the existing issue. 
Moreover, the existing issue is at the partition level and is not precise 
enough.

Here is my solution:

!https://intranetproxy.alipay.com/skylark/lark/0/2023/png/62256341/1692328998008-f9dc6530-e44e-43e7-9b75-d760b55b3dfa.png|width=335,id=WXCCX!

As shown in the figure, both drop partition and clustering will go through a 
period of time that is not registered to the timeline, which is the scenario 
that the previous issue did not solve. Here, I register the replace file IDs 
involved in each replace commit to the active timeline (the replace commit 
timeline that has been submitted has saved partitionToReplaceFileIds, and only 
pending requests need to be processed). Since in the case of Spark SQL, delete 
partition creates a requested commit in advance during write, which is 
inconvenient to handle, I save the pending replace commit's 
partitionToReplaceFileIds information to the inflight commit's extra metadata. 
Therefore, each time drop partition or clustering is executed, it only needs to 
read the partitionToReplaceFileIds information in the timeline after ensuring 
that the inflight commit information has been saved to the timeline to ensure 
that there are no duplicate file IDs and prevent this kind of error from 
occurring.

In simple terms, each replace commit will register the replace file ID 
information to the timeline whether it is submitted or not, at the same time, 
each submission will check this information to ensure that it will not be 
repeated, so that any replace commit containing this file ID will be prevented, 
ensuring that there are no duplicate keys.

When this idea is also implemented on the compaction commit, the modification 
involved in the related issue can be removed.

  was:
Related issue: https://issues.apache.org/jira/browse/HUDI-5553



The specific problem is that when concurrent replace commit operations are 
executed, two replace commits may point to the same file ID, resulting in a 
duplicate key error. The existing issue solved the problem of scheduling delete 
partition while there are pending clustering or compaction operations, which 
will be prevented in this case. However, this solution is not perfect and may 
still cause data inconsistency if a clustering plan is scheduled before the 
delete partition is committed. Because validation is one-way.In this case, both 
replace commits will still contain duplicate keys, and the table will become 
inconsistent when both plans are committed. This is very fatal, and there are 
other similar scenarios that may bypass the validation of the existing issue. 
Moreover, the existing issue is at the partition level and is not precise 
enough.



Here is my solution:

!https://intranetproxy.alipay.com/skylark/lark/0/2023/png/62256341/1692328998008-f9dc6530-e44e-43e7-9b75-d760b55b3dfa.png|width=335,id=WXCCX!

As shown in the figure, both drop partition and clustering will go through a 
period of time that is not registered to the timeline, which is the scenario 
that the previous issue did not solve. Here, I register the replace file IDs 
involved in each replace commit to the active timeline (the replace commit 
timeline that has been submitted has saved partitionToReplaceFileIds, and only 
pending requests need to be processed). Since in the case of Spark SQL, delete 
partition creates a requested commit in advance during write, which is 
inconvenient to handle, I save the pending replace commit's 
partitionToReplaceFileIds information to the inflight commit's extra metadata. 
Therefore, each time drop partition or clustering is executed, it only needs to 
read the partitionToReplaceFileIds information in the timeline after ensuring 
that the inflight commit information has been saved to the timeline to ensure 
that there are no duplicate file IDs and prevent this kind of error from 
occurring.



In simple terms, each replace commit will register the replace file ID 
information to the timeline whether it is submitted or not, at the same time, 
each submission will check this information to ensure that it will not be 
repeated, so that any replace commit containing this file ID will be prevented, 
ensuring that there are no duplicate keys.


> Fix data inconsistency issues caused by concurrent clustering and delete 
> partition.
> -----------------------------------------------------------------------------------
>
>                 Key: HUDI-6719
>                 URL: https://issues.apache.org/jira/browse/HUDI-6719
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Ma Jian
>            Priority: Major
>
> Related issue: https://issues.apache.org/jira/browse/HUDI-5553
> The specific problem is that when concurrent replace commit operations are 
> executed, two replace commits may point to the same file ID, resulting in a 
> duplicate key error. The existing issue solved the problem of scheduling 
> delete partition while there are pending clustering or compaction operations, 
> which will be prevented in this case. However, this solution is not perfect 
> and may still cause data inconsistency if a clustering plan is scheduled 
> before the delete partition is committed. Because validation is one-way.In 
> this case, both replace commits will still contain duplicate keys, and the 
> table will become inconsistent when both plans are committed. This is very 
> fatal, and there are other similar scenarios that may bypass the validation 
> of the existing issue. Moreover, the existing issue is at the partition level 
> and is not precise enough.
> Here is my solution:
> !https://intranetproxy.alipay.com/skylark/lark/0/2023/png/62256341/1692328998008-f9dc6530-e44e-43e7-9b75-d760b55b3dfa.png|width=335,id=WXCCX!
> As shown in the figure, both drop partition and clustering will go through a 
> period of time that is not registered to the timeline, which is the scenario 
> that the previous issue did not solve. Here, I register the replace file IDs 
> involved in each replace commit to the active timeline (the replace commit 
> timeline that has been submitted has saved partitionToReplaceFileIds, and 
> only pending requests need to be processed). Since in the case of Spark SQL, 
> delete partition creates a requested commit in advance during write, which is 
> inconvenient to handle, I save the pending replace commit's 
> partitionToReplaceFileIds information to the inflight commit's extra 
> metadata. Therefore, each time drop partition or clustering is executed, it 
> only needs to read the partitionToReplaceFileIds information in the timeline 
> after ensuring that the inflight commit information has been saved to the 
> timeline to ensure that there are no duplicate file IDs and prevent this kind 
> of error from occurring.
> In simple terms, each replace commit will register the replace file ID 
> information to the timeline whether it is submitted or not, at the same time, 
> each submission will check this information to ensure that it will not be 
> repeated, so that any replace commit containing this file ID will be 
> prevented, ensuring that there are no duplicate keys.
> When this idea is also implemented on the compaction commit, the modification 
> involved in the related issue can be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6719) Fix data inconsistency issues caused by concurrent clustering and delete partition.

Reply via email to