[GitHub] spark pull request #18882: [SPARK-21652][SQL] Filter out meaningless constra...

maropu Tue, 08 Aug 2017 04:27:14 -0700

GitHub user maropu opened a pull request:

    https://github.com/apache/spark/pull/18882


    [SPARK-21652][SQL] Filter out meaningless constraints inferred in 
inferAdditionalConstraints

    ## What changes were proposed in this pull request?
    This pr added code to filter out meaningless constraints inferred in 
`inferAdditionalConstraints` (e.g., given constraint `a = 1`, `b = 1`, `a = c`, 
and  `b = c`, we inferred `a = b` and this predicate was trivially true). These 
constraints possibly cause some `Optimizer` overhead and, for example;
    ```
    scala> Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
    scala> Seq(1, 2).toDF("col").write.saveAsTable("t2")
    scala> spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 
AND t1.col1 = t2.col AND t1.col2 = t2.col").explain(true)
    ```
    
    In this query, `InferFiltersFromConstraints` infers a new constraint 
'(col2#33 = col1#32)' that is appended to the join condition, then 
`PushPredicateThroughJoin` pushes it down, `ConstantPropagation` replaces 
'(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, 
`ConstantFolding` replaces '1 = 1' with 'true and `BooleanSimplification` 
finally removes this predicate. However, `InferFiltersFromConstraints` will 
again infer '(col2#33 = col1#32)' on the next iteration and the process will 
continue until the limit of iterations is reached.
    See below for more details
    
    ```
    === Applying Rule 
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
    !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))                     
                  Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && 
(col2#33 = col#34)))
     :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
(1 = col2#33)))   :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
((col1#32 = 1) && (1 = col2#33)))
     :  +- Relation[col1#32,col2#33] parquet                                    
                  :  +- Relation[col1#32,col2#33] parquet
     +- Filter ((1 = col#34) && isnotnull(col#34))                              
                  +- Filter ((1 = col#34) && isnotnull(col#34))
        +- Relation[col#34] parquet                                             
                     +- Relation[col#34] parquet
                    
    
    === Applying Rule 
org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
    !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = 
col#34)))              Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
    !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
(1 = col2#33)))   :- Filter (col2#33 = col1#32)
    !:  +- Relation[col1#32,col2#33] parquet                                    
                  :  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
((col1#32 = 1) && (1 = col2#33)))
    !+- Filter ((1 = col#34) && isnotnull(col#34))                              
                  :     +- Relation[col1#32,col2#33] parquet
    !   +- Relation[col#34] parquet                                             
                  +- Filter ((1 = col#34) && isnotnull(col#34))
    !                                                                           
                     +- Relation[col#34] parquet
                    
    
    === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters ===
     Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))                     
                     Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
    !:- Filter (col2#33 = col1#32)                                              
                     :- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && 
((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32))
    !:  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
&& (1 = col2#33)))   :  +- Relation[col1#32,col2#33] parquet
    !:     +- Relation[col1#32,col2#33] parquet                                 
                     +- Filter ((1 = col#34) && isnotnull(col#34))
    !+- Filter ((1 = col#34) && isnotnull(col#34))                              
                        +- Relation[col#34] parquet
    !   +- Relation[col#34] parquet                                             
                     
                    
    
    === Applying Rule 
org.apache.spark.sql.catalyst.optimizer.ConstantPropagation ===
     Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))                     
                                           Join Inner, ((col1#32 = col#34) && 
(col2#33 = col#34))
    !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
&& (1 = col2#33))) && (col2#33 = col1#32))   :- Filter (((isnotnull(col1#32) && 
isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && (1 = 1))
     :  +- Relation[col1#32,col2#33] parquet                                    
                                           :  +- Relation[col1#32,col2#33] 
parquet
     +- Filter ((1 = col#34) && isnotnull(col#34))                              
                                           +- Filter ((1 = col#34) && 
isnotnull(col#34))
        +- Relation[col#34] parquet                                             
                                              +- Relation[col#34] parquet
                    
    
    === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantFolding 
===
     Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))                     
                               Join Inner, ((col1#32 = col#34) && (col2#33 = 
col#34))
    !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
&& (1 = col2#33))) && (1 = 1))   :- Filter (((isnotnull(col1#32) && 
isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && true)
     :  +- Relation[col1#32,col2#33] parquet                                    
                               :  +- Relation[col1#32,col2#33] parquet
     +- Filter ((1 = col#34) && isnotnull(col#34))                              
                               +- Filter ((1 = col#34) && isnotnull(col#34))
        +- Relation[col#34] parquet                                             
                                  +- Relation[col#34] parquet
                    
    
    === Applying Rule 
org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
     Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))                     
                            Join Inner, ((col1#32 = col#34) && (col2#33 = 
col#34))
    !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
&& (1 = col2#33))) && true)   :- Filter ((isnotnull(col1#32) && 
isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33)))
     :  +- Relation[col1#32,col2#33] parquet                                    
                            :  +- Relation[col1#32,col2#33] parquet
     +- Filter ((1 = col#34) && isnotnull(col#34))                              
                            +- Filter ((1 = col#34) && isnotnull(col#34))
        +- Relation[col#34] parquet                                             
                               +- Relation[col#34] parquet
                    
    
    === Applying Rule 
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
    !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))                     
                  Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && 
(col2#33 = col#34)))
     :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
(1 = col2#33)))   :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
((col1#32 = 1) && (1 = col2#33)))
     :  +- Relation[col1#32,col2#33] parquet                                    
                  :  +- Relation[col1#32,col2#33] parquet
     +- Filter ((1 = col#34) && isnotnull(col#34))                              
                  +- Filter ((1 = col#34) && isnotnull(col#34))
        +- Relation[col#34] parquet  
    ```
    
    ## How was this patch tested?
    Added tests in `InferFiltersFromConstraintsSuite`.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/maropu/spark SPARK-21652

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18882.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18882
    
----
commit d253e40788b9e3408c106eff0ba84ae97d715cbb
Author: Takeshi Yamamuro <[email protected]>
Date:   2017-08-08T11:08:38Z

    Should not infer the constraints that are trivially true

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18882: [SPARK-21652][SQL] Filter out meaningless constra...

Reply via email to