peter-toth commented on PR #53859: URL: https://github.com/apache/spark/pull/53859#issuecomment-3778443598
> Although we need to handle `copyFromTag` independently, the proposal itself sounds reasonable to me. Do you think you can share some supporting performance numbers based on the existing benchmark or from your production environment? Numbers depend heavily on the usecase. In our case a customer would like to use SPJ, between table `A` and `B`. Both tables are storage partitoned, but `B` is storage partitioned by some columns that don't match the join condition. In this case "one side shuffle" can help if `spark.sql.sources.v2.bucketing.shuffle.enabled` is enabled and only `B` will be shuffled, but the unecessary grouping of partitions still happens in case of `B`. And while storage partitioning of `B` helps in other queries, in this particular usecase it significantly decreases partitioning and slows down the stage before the shuffle. The optimization in this PR is similar to what `spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled` does to keep one side partially clustered, but it works with "one side shuffle". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
