c21 commented on PR #36733: URL: https://github.com/apache/spark/pull/36733#issuecomment-1143188178
@manuzhang - from my understanding, you want to introduce the feature to enforce number of Spark tasks to be same as number of table buckets, when query not reading bucket column(s). I agree with @cloud-fan in https://github.com/apache/spark/pull/27924#issuecomment-1139340835 that it should not be a design goal for bucketed table to control number of Spark tasks. If you are really want to control number of tasks, you can either tune `spark.sql.files.maxPartitionBytes` or add an extra shuffle `repartition()`/`DISTRIBUTE BY`. I understand your concern per https://github.com/apache/spark/pull/27924#issuecomment-1139360593, but I am afraid of we are introducing a feature here not actually used by many other Spark users. To be honest, the required feature seems not popular based on my experience. My 2 cent is it might help us to post in Spark dev mailing list to gather more feedback from developers / users if they indeed has similar requirement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
