peter-toth commented on PR #53859:
URL: https://github.com/apache/spark/pull/53859#issuecomment-3778443598

   > Although we need to handle `copyFromTag` independently, the proposal 
itself sounds reasonable to me. Do you think you can share some supporting 
performance numbers based on the existing benchmark or from your production 
environment?
   
   Numbers depend heavily on the usecase. In our case a customer would like to 
use SPJ, between table `A` and `B`. Both tables are storage partitoned, but `B` 
is storage partitioned by some columns that don't match the join condition. In 
this case "one side shuffle" can help if 
`spark.sql.sources.v2.bucketing.shuffle.enabled` is enabled and only `B` will 
be shuffled, but the unecessary grouping of partitions still happens in case of 
`B`. And while storage partitioning of `B` helps in other queries, in this 
particular usecase it significantly decreases partitioning and slows down the 
stage before the shuffle.
   
   The optimization in this PR is similar to what 
`spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled` does to 
keep one side partially clustered, but it works with "one side shuffle".
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to