shikai93 opened a new issue, #9303: URL: https://github.com/apache/hudi/issues/9303
Similar to https://github.com/apache/hudi/issues/5330, we are seeing exception on `Duplicate fileId - from bucket - of partition - found during the BucketStreamWriteFunction index bootstrap.` When we look into the partition data mentioned, we noticed that there are 2 fileIds that are distinct, but belong to the same bucket identifier : 2023-07-26 13:07:20 445861 00000011-a324-4cfb-b24c-a6c562c921b9_35-40-0_20230726050458695.parquet 2023-07-26 13:01:11 446459 00000011-ea07-47ed-94f2-a471c76d866f_15-20-0_20230726045906050.parquet In our logs, we see that Hudi tried to load both fileIds to that bucket: 2023-07-27 13:55:14,994 | INFO | cket.BucketStreamWriteFunction | Should load this partition bucket 11 with fileId 00000011-a324-4cfb-b24c-a6c562c921b9 2023-07-27 13:55:14,994 | INFO | cket.BucketStreamWriteFunction | Should load this partition bucket 11 with fileId 00000011-ea07-47ed-94f2-a471c76d866f In between writing this data, our Flink job has restarted before however we are confused why are there multiple fileIds assigned to the same bucket and why the previous fileId was not rolledback/ removed. A clear and concise description of the problem. Steps to reproduce the behavior: 1. Run Flink job to ingest data 2. Crash Flink job during ingestion **Expected behavior** Multiple FileIds should not exist for the same bucket number **Environment Description** * Hudi version : 0.12.3 * Flink version : 1.15.3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
