[ 
https://issues.apache.org/jira/browse/HUDI-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3026:
-----------------------------
    Sprint: Cont' improve -  2022/02/07, Cont' improve -  2022/02/14, Cont' 
improve - 2022/03/01  (was: Cont' improve -  2022/02/07, Cont' improve -  
2022/02/14)

> HoodieAppendhandle may result in duplicate key for hbase index
> --------------------------------------------------------------
>
>                 Key: HUDI-3026
>                 URL: https://issues.apache.org/jira/browse/HUDI-3026
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: ZiyueGuan
>            Assignee: ZiyueGuan
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 0.11.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Problem: a same key may occur in two file group when Hbase index is used. 
> These two file group will have same FileID prefix. As Hbase index is global, 
> this is unexpected
> How to repro:
> We should have a table w/o record sorted in spark. Let's say we have five 
> records with key 1,2,3,4,5 to write. They may be iterated in different order. 
> In the first attempt 1, we write three records 5,4,3 to 
> fileID_1_log.1_attempt1. But this attempt failed. Spark will have a try in 
> the second task attempt (attempt 2), we write four records 1,2,3,4 to  
> fileID_1_log.1_attempt2. And then, we find this filegroup is large enough by 
> call canWrite. So hudi write record 5 to fileID_2_log.1_attempt2 and finish 
> this commit.
> When we do compaction, fileID_1_log.1_attempt1 and fileID_1_log.1_attempt2 
> will be compacted. And we finally got 543 + 1234 = 12345 in fileID_1 while we 
> also got 5 in fileID_2. Record 5 will appear in two fileGroup.
> Reason: Markerfile doesn't reconcile log file as code show in  
> [https://github.com/apache/hudi/blob/9a2030ab3190acf600ce4820be9a08929595763e/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java#L553.]
> And log file is actually not fail-safe.
> I'm not sure if [~danny0405] have found this problem too as I find 
> FlinkAppendHandle had been made to always return true. But it was just 
> changed back recently. 
> Solution:
> We may have a quick fix by making canWrite in HoodieAppendHandle always 
> return true. However, I think there may be a more elegant solution that we 
> use append result to generate compaction plan rather than list log file, in 
> which we will have a more granular control on log block instead of log file. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to