xushiyan opened a new pull request, #8490: URL: https://github.com/apache/hudi/pull/8490
### Change Logs When using global index (bloom or simple), and update partition is set to true. There is a chance where record is in p1 at the beginning, and later updated to p2, when updating to p3 and compaction not yet happened, global index joined both old versions of the record in p1 and p2, and tagged 2 records to insert to p3. This sort of duplicates will reside in the dataset and won't be reconciled unless manually dedup the table. When records are inserted into new partitions, existing logic does not honor custom payload, which should be handled by record merger API. ### Impact Global index will load fileslice to perform merge and tagging, which slows down the whole process if a lot partition updates happen. ### Risk level High. - [ ] End to end testing and UT. ### Documentation Update - [ ] New config `hoodie.global.index.reconcile.parallelism` ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
