ChiehFu commented on issue #10914: URL: https://github.com/apache/hudi/issues/10914#issuecomment-2065368620
In addition, I found some duplicates written by my bulk_insert batch job 1 and upsert stream job 2 (the one that had index bootstrap enabled). For bulk_insert batch job, it had `write.precombine` set to `true` so there shouldn't be any duplicates in the result table? For upsert stream job, it had `write.precombine` set to `true` and index bootstrap task had parallelism set to `480`. I found this previous issue https://github.com/apache/hudi/issues/4881 which suggests duplicates can happen when index bootstrap task parallelism > 1. Is that still the case in Hudi 0.14.1? The table that needs to be index bootstrapped is large so I am not sure if setting parallelism to `1` would work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
