hudi-bot opened a new issue, #14957: URL: https://github.com/apache/hudi/issues/14957
Problem: a same key may occur in two file group when Hbase index is used. These two file group will have same FileID prefix. As Hbase index is global, this is unexpected How to repro: We should have a table w/o record sorted in spark. Let's say we have five records with key 1,2,3,4,5 to write. They may be iterated in different order. In the first attempt 1, we write three records 5,4,3 to fileID_1_log.1_attempt1. But this attempt failed. Spark will have a try in the second task attempt (attempt 2), we write four records 1,2,3,4 to fileID_1_log.1_attempt2. And then, we find this filegroup is large enough by call canWrite. So hudi write record 5 to fileID_2_log.1_attempt2 and finish this commit. When we do compaction, fileID_1_log.1_attempt1 and fileID_1_log.1_attempt2 will be compacted. And we finally got 543 + 1234 = 12345 in fileID_1 while we also got 5 in fileID_2. Record 5 will appear in two fileGroup. Reason: Markerfile doesn't reconcile log file as code show in [https://github.com/apache/hudi/blob/9a2030ab3190acf600ce4820be9a08929595763e/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java#L553.] And log file is actually not fail-safe. I'm not sure if [~danny0405] have found this problem too as I find FlinkAppendHandle had been made to always return true. But it was just changed back recently. Solution: We may have a quick fix by making canWrite in HoodieAppendHandle always return true. However, I think there may be a more elegant solution that we use append result to generate compaction plan rather than list log file, in which we will have a more granular control on log block instead of log file. ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-3026 - Type: Bug --- ## Comments 27/Dec/21 09:47;danny0405;I made it as true because each flink record has tagged with a solid location before write, the write handle should never switches to new file group.;;; --- 06/Jan/22 03:46;guanziyue;Thanks for your kind explanation. I have few experience about hudi on flink. This problem may only occur w/ spark.;;; --- 28/May/23 16:07;guanziyue;This bug is fixed by HUDI-1517. In HUDI-1517, we allow marker file for log file use create type. At the end of commit (reconcile process), we will compare marker file with commit write status. Any illegal log files which are created during writing will be deleted, which works same as marker file for base file. See {code:java} HoodieTable.reconcileAgainstMarkers{code};;; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
