Re: [I] [SUPPORT] data skew when writing with bulk_insert + bucket_index enabled [hudi]

via GitHub Thu, 04 Jul 2024 03:17:58 -0700


ziudu commented on issue #11565:
URL: https://github.com/apache/hudi/issues/11565#issuecomment-2208621681


   It seems if the table size is bigger, the data skew is worse. I noticed this 
issue when joining two tables and writing to a result table:
    - if the result table is 5GB, only 37 tasks out of 320 tasks have "Shuffle 
Read Size" larger than 0, ranging from 822MB to 39.2MB.(num_bucket=2, 
spark.sql.shuffle.partitions = 320)
    - if the result table is 20GB, only 60 tasks out of 320 tasks have "Shuffle 
Read Size" larger than 0, ranging from 1.8GB to 102MB.(num_bucket=5, 
spark.sql.shuffle.partitions = 320) 
   So I did a simple read-table - write-table test. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] data skew when writing with bulk_insert + bucket_index enabled [hudi]

Reply via email to