hudi-bot opened a new issue, #14957:
URL: https://github.com/apache/hudi/issues/14957

   Problem: a same key may occur in two file group when Hbase index is used. 
These two file group will have same FileID prefix. As Hbase index is global, 
this is unexpected
   
   How to repro:
   
   We should have a table w/o record sorted in spark. Let's say we have five 
records with key 1,2,3,4,5 to write. They may be iterated in different order. 
   
   In the first attempt 1, we write three records 5,4,3 to 
fileID_1_log.1_attempt1. But this attempt failed. Spark will have a try in the 
second task attempt (attempt 2), we write four records 1,2,3,4 to  
fileID_1_log.1_attempt2. And then, we find this filegroup is large enough by 
call canWrite. So hudi write record 5 to fileID_2_log.1_attempt2 and finish 
this commit.
   
   When we do compaction, fileID_1_log.1_attempt1 and fileID_1_log.1_attempt2 
will be compacted. And we finally got 543 + 1234 = 12345 in fileID_1 while we 
also got 5 in fileID_2. Record 5 will appear in two fileGroup.
   
   Reason: Markerfile doesn't reconcile log file as code show in  
[https://github.com/apache/hudi/blob/9a2030ab3190acf600ce4820be9a08929595763e/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java#L553.]
   
   And log file is actually not fail-safe.
   
   I'm not sure if [~danny0405] have found this problem too as I find 
FlinkAppendHandle had been made to always return true. But it was just changed 
back recently. 
   
   Solution:
   
   We may have a quick fix by making canWrite in HoodieAppendHandle always 
return true. However, I think there may be a more elegant solution that we use 
append result to generate compaction plan rather than list log file, in which 
we will have a more granular control on log block instead of log file. 
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-3026
   - Type: Bug
   
   
   ---
   
   
   ## Comments
   
   27/Dec/21 09:47;danny0405;I made it as true because each flink record has 
tagged with a solid location before write, the write handle should never 
switches to new file group.;;;
   
   ---
   
   06/Jan/22 03:46;guanziyue;Thanks for your kind explanation. I have few 
experience about hudi on flink. This problem may only occur w/ spark.;;;
   
   ---
   
   28/May/23 16:07;guanziyue;This bug is fixed by HUDI-1517. In HUDI-1517, we 
allow marker file for log file use create type. At the end of commit (reconcile 
process), we will compare marker file with commit write status. Any illegal log 
files which are created during writing will be deleted, which works same as 
marker file for base file.
   
   See
   {code:java}
   HoodieTable.reconcileAgainstMarkers{code};;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to