[GitHub] [spark] William1104 opened a new pull request #24912: [WIP] SPARK-28103 Fix constraints of a Union Logical Plan

GitBox Wed, 19 Jun 2019 09:38:28 -0700

William1104 opened a new pull request #24912: [WIP] SPARK-28103 Fix constraints 
of a Union Logical Plan
URL: https://github.com/apache/spark/pull/24912
 
 
   [ Work in progress] 
   
   ## What changes were proposed in this pull request?
   Currently, the constraints of a union table could be turned empty if any 
subtable is turned into an empty local relation. The side effect is filter 
cannot be inferred correctly (by InferFiltersFromConstrains). 
   
   This PR updates how the constraints of a Union table is calculated. 
Basically, it skips a child if it is based on an empty local relation. 
    
   We may reproduce the issue with the following setup:
   1) Prepare two tables: 
   ```
   spark.sql("CREATE TABLE IF NOT EXISTS table1(id string, val string) USING 
PARQUET");
   spark.sql("CREATE TABLE IF NOT EXISTS table2(id string, val string) USING 
PARQUET");
   ```
   
   2) Create a union view on table1. 
   ```
   spark.sql("""
        | CREATE VIEW partitioned_table_1 AS
        | SELECT * FROM table1 WHERE id = 'a'
        | UNION ALL
        | SELECT * FROM table1 WHERE id = 'b'
        | UNION ALL
        | SELECT * FROM table1 WHERE id = 'c'
        | UNION ALL
        | SELECT * FROM table1 WHERE id NOT IN ('a','b','c')
        | """.stripMargin)
    ```
   
    3) View the optimized plan of this SQL. The filter '[t2.id = 'a']' cannot 
be inferred. We can see that the constraints of the left table are empty. 
   ```
   scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE 
t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 
'a'").queryExecution.optimizedPlan
   
   res39: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
   Join Inner, (id#0 = id#4)
   :- Union
   :  :- Filter (isnotnull(id#0) && (id#0 = a))
   :  :  +- Relation[id#0,val#1] parquet
   :  :- LocalRelation <empty>, [id#0, val#1]
   :  :- LocalRelation <empty>, [id#0, val#1]
   :  +- Filter ((isnotnull(id#0) && NOT id#0 IN (a,b,c)) && (id#0 = a))
   :     +- Relation[id#0,val#1] parquet
   +- Filter isnotnull(id#4)
      +- Relation[id#4,val#5] parquet
   
   scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE 
t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 
'a'").queryExecution.optimizedPlan.children(0).constraints
   res40: org.apache.spark.sql.catalyst.expressions.ExpressionSet = Set()
    ```
   
   After applying the above fix, the constraints of the union table is not 
empty anymore and the filter should be inferred properly.
   ```
   scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE 
t1.id = t2.id AND t1.id = 'a' ").queryExecution.optimizedPlan
   res9: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
   Join Inner, (id#0 = id#16)
   :- Union
   :  :- Filter (isnotnull(id#0) AND (id#0 = a))
   :  :  +- Relation[id#0,val#1] parquet
   :  :- LocalRelation <empty>, [id#0, val#1]
   :  :- LocalRelation <empty>, [id#0, val#1]
   :  +- Filter ((isnotnull(id#0) AND NOT id#0 IN (a,b,c)) AND (id#0 = a))
   :     +- Relation[id#0,val#1] parquet
   +- Filter ((id#16 = a) AND isnotnull(id#16))
      +- Relation[id#16,val#17] parquet
   
   scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE 
t1.id = t2.id AND t1.id = 'a' 
").queryExecution.optimizedPlan.children(0).constraints
   res10: org.apache.spark.sql.catalyst.expressions.ExpressionSet = 
Set(isnotnull(id#0), (id#0 = a))
   
   scala>
   ```
   
   ## How was this patch tested?
   
   (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
   (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
   
   Please review https://spark.apache.org/contributing.html before opening a 
pull request.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] William1104 opened a new pull request #24912: [WIP] SPARK-28103 Fix constraints of a Union Logical Plan

Reply via email to