jing created HUDI-2780:
--------------------------

             Summary: Mor reads the log file and skips the complete block as a 
bad block, resulting in data loss
                 Key: HUDI-2780
                 URL: https://issues.apache.org/jira/browse/HUDI-2780
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: jing
         Attachments: image-2021-11-17-15-45-33-031.png, 
image-2021-11-17-15-46-04-313.png, image-2021-11-17-15-46-14-694.png

Check the data in the middle of the bad block through debug, and find that the 
lost data is in the offset of the bad block, but because of the eof skip during 
the reading, the compact merge cannot be written to the parquet at that time, 
but the deltacommit of the time is successful. There are two consecutive hudi 
magic in the middle of the bad block. Reading blocksize in the next digit 
actually reads the binary conversion of #HUDI# to 1227030528, which means that 
the eof exception is reported when the file size is exceeded.

!image-2021-11-17-15-45-33-031.png!

Detect the position of the next block and skip the bad block. It should not 
start from the position after reading the blocksize, but from the position 
before reading the blocksize

!image-2021-11-17-15-46-04-313.png!

!image-2021-11-17-15-46-14-694.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to