ad1happy2go commented on issue #8532: URL: https://github.com/apache/hudi/issues/8532#issuecomment-1573386158
Please find the response for your queries - **how is hoodie.parquet.max.file.size and shuffle.parallelism related ?** If the operation is resulting updating too many file groups, then we should give a higher number of shuffle.parallelism so that it can parallelise writing. **what causes shuffling of data** While doing bulk insert, it do the shuffle and sort in order to give good read performance. You can disable these flags to avoid shuffle. That should speed up the process. Configs - write.bulk_insert.sort_input and write.bulk_insert.shuffle_input hoodie.bulkinsert.shuffle.parallelism - This property should be directly depends on number of file groups it is updating. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
