[GitHub] spark issue #18692: [SPARK-21417][SQL] Detect joind conditions via filter ex...

aokolnychyi Mon, 31 Jul 2017 13:20:44 -0700

Github user aokolnychyi commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    @gatorsmile I took a look at the case above. Indeed, the proposed rule 
triggers this issue but only indirectly. In the example above, the optimizer 
will never reach a fixed point. Please, find my investigation below.
    
    ```
    ... 
    
    // The new rule infers correct join predicates
    Join Inner, ((col2#33 = col#34) && (col1#32 = col#34))
    :- Filter ((col1#32 = col2#33) && (col1#32 = 1))
    :  +- Relation[col1#32,col2#33] parquet
    +- Filter (col#34 = 1)
       +- Relation[col#34] parquet
    
    // InferFiltersFromConstraints adds more filters
    Join Inner, ((col2#33 = col#34) && (col1#32 = col#34))
    :- Filter ((((col2#33 = 1) && isnotnull(col1#32)) && isnotnull(col2#33)) && 
((col1#32 = col2#33) && (col1#32 = 1)))
    :  +- Relation[col1#32,col2#33] parquet
    +- Filter (isnotnull(col#34) && (col#34 = 1))
       +- Relation[col#34] parquet
    
    // ConstantPropagation is applied
    Join Inner, ((col2#33 = col#34) && (col1#32 = col#34))
    !:- Filter (((((col2#33 = 1) && isnotnull(col2#33)) && isnotnull(col1#32)) 
&& ((1 = col2#33) && (col1#32 = 1))) 
     :  +- Relation[col1#32,col2#33] parquet
     +- Filter (isnotnull(col#34) && (col#34 = 1))
        +- Relation[col#34] parquet                          
    
    // (Important) InferFiltersFromConstraints infers (col1#32 = col2#33), 
which is added to the join condition.
    Join Inner, ((col1#32 = col2#33) && ((col2#33 = col#34) && (col1#32 = 
col#34)))
    !:- Filter (((((col2#33 = 1) && isnotnull(col2#33)) && isnotnull(col1#32)) 
&& ((1 = col2#33) && (col1#32 = 1))) 
     :  +- Relation[col1#32,col2#33] parquet
     +- Filter (isnotnull(col#34) && (col#34 = 1))
        +- Relation[col#34] parquet
    
     // PushPredicateThroughJoin pushes down (col1#32 = col2#33) and then 
CombineFilters produces
    Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
    !:- Filter ((((isnotnull(col1#32) && (col2#33 = 1)) && isnotnull(col2#33)) 
&& ((1 = col2#33) && (col1#32 = 1))) && (col2#33 = col1#32))
     :  +- Relation[col1#32,col2#33] parquet
     +- Filter (isnotnull(col#34) && (col#34 = 1))
        +- Relation[col#34] parquet                                             
                         
    
    ```
    After that, `ConstantPropagation` replaces `(col2#33 = col1#32)` as `(1 = 
1)`, `BooleanSimplification` removes `(1 = 1)`, `InferFiltersFromConstraints` 
infers `(col2#33 = col1#32)` again and the procedure repeats forever. Since 
`InferFiltersFromConstraints` is the last optimization rule, we have the 
redundant condition mentioned by you. The Optimizer without the new rule will 
also not converge on the following query:
    
    ```
    Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
    Seq(1, 2).toDF("col").write.saveAsTable("t2")
    spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND 
t1.col1 = t2.col AND t1.col2 = t2.col").explain(true)
    ```
    Correct me if I am wrong, but it seems like an issue with the existing 
rules.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #18692: [SPARK-21417][SQL] Detect joind conditions via filter ex...

Reply via email to