Wenchen Fan created SPARK-12704:
-----------------------------------

             Summary: we may repartition a relation even it's not needed
                 Key: SPARK-12704
                 URL: https://issues.apache.org/jira/browse/SPARK-12704
             Project: Spark
          Issue Type: Bug
          Components: SQL
            Reporter: Wenchen Fan


The implementation of {{HashPartitioning.compatibleWith}} has been wrong for a 
while. Think of the following case:
if {{table_a}} is hash partitioned by int column `i`, and {{table_b}} is also 
partitioned by int column `i`, logically these 2 partitionings are compatible. 
However, {{HashPartitioning.compatibleWith}} will return false for this case as 
the {{AttributeReference}} of column `i` between these 2 tables have different 
expr ids.

With this wrong result of {{HashPartitioning.compatibleWith}}, we will go into 
[this 
branch|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L390]
 and may add unnecessary shuffle.

This won't impact correctness if the join keys are exactly the same with hash 
partitioning keys, as there’s still an opportunity to ​not​ partition that 
child in that branch: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L428

However, if the join keys are a super-set of hash partitioning keys, for 
example, {{table_a}} and {{table_b}} are both hash partitioned by column `i`, 
and we wanna join them using column `i, j`, logically we don't need shuffle but 
in fact the 2 tables start out as partitioned only by `i` and is redundantly 
repartitioned by `i, j`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to