[jira] [Commented] (HUDI-3026) HoodieAppendhandle may result in duplicate key for hbase index

ZiyueGuan (Jira) Sun, 28 May 2023 09:08:06 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726947#comment-17726947
 ]


ZiyueGuan commented on HUDI-3026:
---------------------------------

This bug is fixed by HUDI-1517. In HUDI-1517, we allow marker file for log file 
use create type. At the end of commit (reconcile process), we will compare 
marker file with commit write status. Any illegal log files which are created 
during writing will be deleted, which works same as marker file for base file.

See
{code:java}
HoodieTable.reconcileAgainstMarkers{code}

> HoodieAppendhandle may result in duplicate key for hbase index
> --------------------------------------------------------------
>
>                 Key: HUDI-3026
>                 URL: https://issues.apache.org/jira/browse/HUDI-3026
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: ZiyueGuan
>            Assignee: ZiyueGuan
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 0.14.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Problem: a same key may occur in two file group when Hbase index is used. 
> These two file group will have same FileID prefix. As Hbase index is global, 
> this is unexpected
> How to repro:
> We should have a table w/o record sorted in spark. Let's say we have five 
> records with key 1,2,3,4,5 to write. They may be iterated in different order. 
> In the first attempt 1, we write three records 5,4,3 to 
> fileID_1_log.1_attempt1. But this attempt failed. Spark will have a try in 
> the second task attempt (attempt 2), we write four records 1,2,3,4 to  
> fileID_1_log.1_attempt2. And then, we find this filegroup is large enough by 
> call canWrite. So hudi write record 5 to fileID_2_log.1_attempt2 and finish 
> this commit.
> When we do compaction, fileID_1_log.1_attempt1 and fileID_1_log.1_attempt2 
> will be compacted. And we finally got 543 + 1234 = 12345 in fileID_1 while we 
> also got 5 in fileID_2. Record 5 will appear in two fileGroup.
> Reason: Markerfile doesn't reconcile log file as code show in  
> [https://github.com/apache/hudi/blob/9a2030ab3190acf600ce4820be9a08929595763e/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java#L553.]
> And log file is actually not fail-safe.
> I'm not sure if [~danny0405] have found this problem too as I find 
> FlinkAppendHandle had been made to always return true. But it was just 
> changed back recently. 
> Solution:
> We may have a quick fix by making canWrite in HoodieAppendHandle always 
> return true. However, I think there may be a more elegant solution that we 
> use append result to generate compaction plan rather than list log file, in 
> which we will have a more granular control on log block instead of log file. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-3026) HoodieAppendhandle may result in duplicate key for hbase index

Reply via email to