ergunbaris commented on issue #11718: URL: https://github.com/apache/hudi/issues/11718#issuecomment-2265252110
@KnightChess Thanks for the question I have looked into production code and hoodie directory and timeline is like below - Replay-fix (INSERT_OVERWRITE) ran between 2024-04-24 15:48:59.886044+00:00 AND 2024-04-27 01:45:39.598175+00:00 - Hudi cleaner ran 20240501084821521 (this is the hudi-cli cleans show output I guess it is UTC?) for 44 minutes and deleted all the unreferenced parquet files except it didn't for the problematic date partitions - All the replacecommits archived May 1, 2024, 11:33:47 (UTC+01:00) which is May 1, 2024, 10:33:47 (UTC+00:00) So basically all the replace commits were archived after cleaner ran. On the other hand first archived replacecommits would always be the oldest date partition dates. But in our case oldest date partitions were successfully processed and all the date partitions related to the batches towards the end of the process were duplicated! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
