ziudu commented on issue #11565:
URL: https://github.com/apache/hudi/issues/11565#issuecomment-2208621681
It seems if the table size is bigger, the data skew is worse. I noticed this
issue when joining two tables and writing to a result table:
- if the result table is 5GB, only 37 tasks out of 320 tasks have "Shuffle
Read Size" larger than 0, ranging from 822MB to 39.2MB.(num_bucket=2,
spark.sql.shuffle.partitions = 320)
- if the result table is 20GB, only 60 tasks out of 320 tasks have "Shuffle
Read Size" larger than 0, ranging from 1.8GB to 102MB.(num_bucket=5,
spark.sql.shuffle.partitions = 320)
So I did a simple read-table - write-table test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]