big-doudou commented on PR #9182: URL: https://github.com/apache/hudi/pull/9182#issuecomment-1649253338
Flink sink hudi uses bucket index. If the amount of data between checkpoints is relatively large, part of the data will be flushed to hdfs first, and a file ID will be generated at this time. If the TM restarts abnormally before the checkpoint is completed, this code will judge flink job partial-failover and recovery, and Bootstrap() will not be executed. Therefore, the previously generated instant is reused, and the old log file will not be cleaned up, resulting in duplicate file ids in the same bucket. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
