Github user scwf commented on the pull request:
https://github.com/apache/spark/pull/3694#issuecomment-76318262
sorry for delay, my initial idea here is
1 we can set spark.default.parallsim to control the partitions num for
shuffle but this config option do not sensitive to data size of rdd, that is
for one job with 1T input data the partitions num is x but for the same job
with 1K input data the partitions num is also x.
2 if we not set spark.default.parallsim, spark rdd use parent rdd's
partitions num as its partitions num, but in this way i found that there maybe
some mini-tasks in some case due to the big partitions num of parent rdd, so i
think maybe we can give a ratio to control the shuffle partition num
ok, i am closing this
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]