Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19080 > Both sides will satisfy the required distribution of the join This is not true now. After this PR, join has a stricter distribution requirement called `HashPartitionedDistribution`, so range partitioning doesn't satisfy it and Spark will add shuffle. I think this is reasonable, `ClusteredDistribution` can not represent the co-partition requirement of join. > Can you give a concrete example? let's take join as a example, `t1 join t2 on t1.a = t2.x and t1.b = t2.y`. Then the join requires `HashPartitionedDistribution(a, b)` for `t1`, and requires `HashPartitionedDistribution(x, y)` for `t2`. According to the definition of `HashPartitionedDistribution`, if `t1` has a tuple `(a=1,b=2)`, it will be in a certain partition, let's say the second partition of `t1`. If `t2` has a tuple `(x=1,y=2)`, it will also be in the second partition of `t2`, because Spark guarantees `t1` and `t2` have the same number of partitions, and `HashPartitionedDistribution` determines the partition given a tuple and numParttions. So we can safely zip partitions of `t1` and `t2` and do the join.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org