Tejas Patil created SPARK-17570:

             Summary: Avoid Hash and Exchange in Sort Merge join if bucketing 
factor is multiple for tables
                 Key: SPARK-17570
                 URL: https://issues.apache.org/jira/browse/SPARK-17570
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Tejas Patil
            Priority: Minor

In case of bucketed tables, Spark will avoid doing `Sort` and `Exchange` if the 
input tables and output table has same number of buckets. However, unequal 
bucketing will always lead to `Sort` and `Exchange`. If the number of buckets 
in the output table is a factor of the buckets in the input table, we should be 
able to avoid `Sort` and `Exchange` and directly join those.

Assume Input1, Input2 and Output be bucketed + sorted tables over the same 
columns but with different number of buckets. Input1 has 8 buckets, Input1 has 
4 buckets and Output has 4 buckets. Since hash-partitioning is done using 
Modulus, if we JOIN buckets (0, 4) of Input1 and buckets (0, 4, 8) of Input2 in 
the same task, it would give the bucket 0 of output table.

Input1   (0, 4)      (1, 3)      (2, 5)       (3, 7)
Input2   (0, 4, 8)   (1, 3, 9)   (2, 5, 10)   (3, 7, 11)
Output   (0)         (1)         (2)          (3)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to