[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...

cloud-fan Wed, 30 Aug 2017 18:06:12 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19080
  
    > Both sides will satisfy the required distribution of the join
    
    This is not true now. After this PR, join has a stricter distribution 
requirement called `HashPartitionedDistribution`, so range partitioning doesn't 
satisfy it and Spark will add shuffle.
    
    I think this is reasonable, `ClusteredDistribution` can not represent the 
co-partition requirement of join.
    
    > Can you give a concrete example?
    
    let's take join as a example, `t1 join t2 on t1.a = t2.x and t1.b = t2.y`. 
Then the join requires `HashPartitionedDistribution(a, b)` for `t1`, and 
requires `HashPartitionedDistribution(x, y)` for `t2`.
    
    According to the definition of `HashPartitionedDistribution`, if `t1` has a 
tuple `(a=1,b=2)`, it will be in a certain partition, let's say the second 
partition of `t1`.
    
    If `t2` has a tuple `(x=1,y=2)`, it will also be in the second partition of 
`t2`, because Spark guarantees `t1` and `t2` have the same number of 
partitions, and `HashPartitionedDistribution` determines the partition given a 
tuple and numParttions. So we can safely zip partitions of `t1` and `t2` and do 
the join.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...

Reply via email to