TJX2014 opened a new pull request, #6595:
URL: https://github.com/apache/hudi/pull/6595

   …th spark lead to duplicate bucket issue
   
   ### Change Logs
   Make hudi-flink of mor table also will gen CreateHandle with base bucket not 
exist.
   Open deduplicate function for mor table.
   
   ### Impact
   The duplicate issue is from hudi-flink mor table, which first append log, 
but not compact right now, so the bucket num is not in base file;
   When spark use loadPartitionBucketIdFileIdMapping of 
org.apache.hudi.index.bucket.HoodieSimpleBucketIndex, it will not find the 
bucket num which written by hudi-flink, so it will generate a new one which not 
consistent with hudi-flink.
   After this change, when hudi-flink write mor table use bucket index, it will 
firstly consider to write base parquet file after deduplicate, if base file 
exists, it will change to write log file, follow spark way seems more stable 
for mor table.
   
   **Risk level: none | low | medium | high**
   None.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to