TJX2014 opened a new pull request, #6595: URL: https://github.com/apache/hudi/pull/6595
…th spark lead to duplicate bucket issue ### Change Logs Make hudi-flink of mor table also will gen CreateHandle with base bucket not exist. Open deduplicate function for mor table. ### Impact The duplicate issue is from hudi-flink mor table, which first append log, but not compact right now, so the bucket num is not in base file; When spark use loadPartitionBucketIdFileIdMapping of org.apache.hudi.index.bucket.HoodieSimpleBucketIndex, it will not find the bucket num which written by hudi-flink, so it will generate a new one which not consistent with hudi-flink. After this change, when hudi-flink write mor table use bucket index, it will firstly consider to write base parquet file after deduplicate, if base file exists, it will change to write log file, follow spark way seems more stable for mor table. **Risk level: none | low | medium | high** None. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
