ChiehFu commented on issue #10914:
URL: https://github.com/apache/hudi/issues/10914#issuecomment-2065368620

   In addition, I found some duplicates written by my bulk_insert batch job 1 
and upsert stream job 2 (the one that had index bootstrap enabled).
   
   For bulk_insert batch job, it had `write.precombine` set to `true` so there 
shouldn't be any duplicates in the result table?  
   
   For upsert stream job, it had `write.precombine` set to `true` and index 
bootstrap task had parallelism set to `480`. I found this previous issue 
https://github.com/apache/hudi/issues/4881 which suggests duplicates can happen 
when index bootstrap task parallelism > 1. Is that still the case in Hudi 
0.14.1? The table that needs to be index bootstrapped is large so I am not sure 
if setting parallelism to `1` would work.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to