Wenchen Fan created SPARK-12704: ----------------------------------- Summary: we may repartition a relation even it's not needed Key: SPARK-12704 URL: https://issues.apache.org/jira/browse/SPARK-12704 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan
The implementation of {{HashPartitioning.compatibleWith}} has been wrong for a while. Think of the following case: if {{table_a}} is hash partitioned by int column `i`, and {{table_b}} is also partitioned by int column `i`, logically these 2 partitionings are compatible. However, {{HashPartitioning.compatibleWith}} will return false for this case as the {{AttributeReference}} of column `i` between these 2 tables have different expr ids. With this wrong result of {{HashPartitioning.compatibleWith}}, we will go into [this branch|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L390] and may add unnecessary shuffle. This won't impact correctness if the join keys are exactly the same with hash partitioning keys, as there’s still an opportunity to not partition that child in that branch: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L428 However, if the join keys are a super-set of hash partitioning keys, for example, {{table_a}} and {{table_b}} are both hash partitioned by column `i`, and we wanna join them using column `i, j`, logically we don't need shuffle but in fact the 2 tables start out as partitioned only by `i` and is redundantly repartitioned by `i, j`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org