[GitHub] [spark] c21 commented on pull request #36733: [SPARK-39344][SQL] Only disable bucketing when autoBucketedScan is enabled if bucket columns are not in scan output

GitBox Tue, 31 May 2022 23:56:02 -0700


c21 commented on PR #36733:
URL: https://github.com/apache/spark/pull/36733#issuecomment-1143188178


   @manuzhang - from my understanding, you want to introduce the feature to 
enforce number of Spark tasks to be same as number of table buckets, when query 
not reading bucket column(s). I agree with @cloud-fan in 
https://github.com/apache/spark/pull/27924#issuecomment-1139340835 that it 
should not be a design goal for bucketed table to control number of Spark tasks.
   
   If you are really want to control number of tasks, you can either tune 
`spark.sql.files.maxPartitionBytes` or add an extra shuffle 
`repartition()`/`DISTRIBUTE BY`. I understand your concern per 
https://github.com/apache/spark/pull/27924#issuecomment-1139360593, but I am 
afraid of we are introducing a feature here not actually used by many other 
Spark users. To be honest, the required feature seems not popular based on my 
experience. My 2 cent is it might help us to post in Spark dev mailing list to 
gather more feedback from developers / users if they indeed has similar 
requirement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on pull request #36733: [SPARK-39344][SQL] Only disable bucketing when autoBucketedScan is enabled if bucket columns are not in scan output

Reply via email to