jing created HUDI-2780:
--------------------------
Summary: Mor reads the log file and skips the complete block as a
bad block, resulting in data loss
Key: HUDI-2780
URL: https://issues.apache.org/jira/browse/HUDI-2780
Project: Apache Hudi
Issue Type: Bug
Reporter: jing
Attachments: image-2021-11-17-15-45-33-031.png,
image-2021-11-17-15-46-04-313.png, image-2021-11-17-15-46-14-694.png
Check the data in the middle of the bad block through debug, and find that the
lost data is in the offset of the bad block, but because of the eof skip during
the reading, the compact merge cannot be written to the parquet at that time,
but the deltacommit of the time is successful. There are two consecutive hudi
magic in the middle of the bad block. Reading blocksize in the next digit
actually reads the binary conversion of #HUDI# to 1227030528, which means that
the eof exception is reported when the file size is exceeded.
!image-2021-11-17-15-45-33-031.png!
Detect the position of the next block and skip the bad block. It should not
start from the position after reading the blocksize, but from the position
before reading the blocksize
!image-2021-11-17-15-46-04-313.png!
!image-2021-11-17-15-46-14-694.png!
--
This message was sent by Atlassian Jira
(v8.20.1#820001)