[jira] [Created] (HUDI-3026) HoodieAppendhandle may result in duplicate key for hbase index

ZiyueGuan (Jira) Wed, 15 Dec 2021 06:51:36 -0800

ZiyueGuan created HUDI-3026:
-------------------------------

             Summary: HoodieAppendhandle may result in duplicate key for hbase 
index
                 Key: HUDI-3026
                 URL: https://issues.apache.org/jira/browse/HUDI-3026
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: ZiyueGuan



Problem: a same key may occur in two file group. These two file group will have 
same FileID prefix.

How to repro:

We should have a table w/o record sorted in spark. Let's say we have 1,2,3,4,5 
records to write. They may be iterated in different order. 

In the first attempt 1, we write 543 to fileID_1_log.1_attempt1. But this 
attempt failed. Spark will have a try in the second task attempt (attempt 2), 
we write 1234 to  fileID_1_log.1_attempt2. And then, we find this filegroup is 
large enough by call canWrite. So hudi write record 5 to 
fileID_2_log.1_attempt2 and finish this commit.

When we do compaction, fileID_1_log.1_attempt1 and fileID_1_log.1_attempt2 will 
be compacted. And we finally got 543 + 1234 = 12345 in fileID_1 while we also 
got 5 in fileID_2. Record 5 will appear in two fileGroup.

Reason: Markerfile doesn't reconcile log file as code show in  
[https://github.com/apache/hudi/blob/9a2030ab3190acf600ce4820be9a08929595763e/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java#L553.]

And log file is actually not fail-safe.

I'm not sure if [~danny0405] have found this problem too as I find 
FlinkAppendHandle had been made to always return true. But it was just changed 
back recently. 

Solution:

We may have a quick fix by making canWrite in HoodieAppendHandle always return 
true. However, I think there may be a more elegant solution that we use append 
result to generate compaction plan rather than list log file, in which we will 
have a more granular control on log block instead of log file. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HUDI-3026) HoodieAppendhandle may result in duplicate key for hbase index

Reply via email to