xushiyan opened a new pull request, #8490:
URL: https://github.com/apache/hudi/pull/8490

   ### Change Logs
   
   When using global index (bloom or simple), and update partition is set to 
true. There is a chance where record is in p1 at the beginning, and later 
updated to p2, when updating to p3 and compaction not yet happened, global 
index joined both old versions of the record in p1 and p2, and tagged 2 records 
to insert to p3. This sort of duplicates will reside in the dataset and won't 
be reconciled unless manually dedup the table.
   
   When records are inserted into new partitions, existing logic does not honor 
custom payload, which should be handled by record merger API.
   
   ### Impact
   
   Global index will load fileslice to perform merge and tagging, which slows 
down the whole process if a lot partition updates happen.
   
   ### Risk level
   
   High.
   
   - [ ] End to end testing and UT.
   
   ### Documentation Update
   
   - [ ]  New config `hoodie.global.index.reconcile.parallelism`
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to