shikai93 opened a new issue, #9303:
URL: https://github.com/apache/hudi/issues/9303

   Similar to https://github.com/apache/hudi/issues/5330, we are seeing 
exception on 
   `Duplicate fileId - from bucket - of partition - found during the 
BucketStreamWriteFunction index bootstrap.`
   When we look into the partition data mentioned, we noticed that there are 2 
fileIds that are distinct, but belong to the same bucket identifier :
                                                     
   2023-07-26 13:07:20    445861 
00000011-a324-4cfb-b24c-a6c562c921b9_35-40-0_20230726050458695.parquet
   2023-07-26 13:01:11     446459 
00000011-ea07-47ed-94f2-a471c76d866f_15-20-0_20230726045906050.parquet
   
   In our logs, we see that Hudi tried to load both fileIds to that bucket:
   2023-07-27 13:55:14,994 | INFO  | cket.BucketStreamWriteFunction  | Should 
load this partition bucket 11 with fileId 00000011-a324-4cfb-b24c-a6c562c921b9
   2023-07-27 13:55:14,994 | INFO  | cket.BucketStreamWriteFunction  | Should 
load this partition bucket 11 with fileId 00000011-ea07-47ed-94f2-a471c76d866f
   
   In between writing this data, our Flink job has restarted before however we 
are confused why are there multiple fileIds assigned to the same bucket and why 
the previous fileId was not rolledback/ removed. 
   
   A clear and concise description of the problem.
   
   Steps to reproduce the behavior:
   
   1. Run Flink job to ingest data
   2. Crash Flink job during ingestion
   
   **Expected behavior**
   
   Multiple FileIds should not exist for the same bucket number
   
   **Environment Description**
   
   * Hudi version : 0.12.3
   
   * Flink version : 1.15.3
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to