William1104 opened a new pull request #24912: [WIP] SPARK-28103 Fix constraints of a Union Logical Plan URL: https://github.com/apache/spark/pull/24912 [ Work in progress] ## What changes were proposed in this pull request? Currently, the constraints of a union table could be turned empty if any subtable is turned into an empty local relation. The side effect is filter cannot be inferred correctly (by InferFiltersFromConstrains). This PR updates how the constraints of a Union table is calculated. Basically, it skips a child if it is based on an empty local relation. We may reproduce the issue with the following setup: 1) Prepare two tables: ``` spark.sql("CREATE TABLE IF NOT EXISTS table1(id string, val string) USING PARQUET"); spark.sql("CREATE TABLE IF NOT EXISTS table2(id string, val string) USING PARQUET"); ``` 2) Create a union view on table1. ``` spark.sql(""" | CREATE VIEW partitioned_table_1 AS | SELECT * FROM table1 WHERE id = 'a' | UNION ALL | SELECT * FROM table1 WHERE id = 'b' | UNION ALL | SELECT * FROM table1 WHERE id = 'c' | UNION ALL | SELECT * FROM table1 WHERE id NOT IN ('a','b','c') | """.stripMargin) ``` 3) View the optimized plan of this SQL. The filter '[t2.id = 'a']' cannot be inferred. We can see that the constraints of the left table are empty. ``` scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 'a'").queryExecution.optimizedPlan res39: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Join Inner, (id#0 = id#4) :- Union : :- Filter (isnotnull(id#0) && (id#0 = a)) : : +- Relation[id#0,val#1] parquet : :- LocalRelation <empty>, [id#0, val#1] : :- LocalRelation <empty>, [id#0, val#1] : +- Filter ((isnotnull(id#0) && NOT id#0 IN (a,b,c)) && (id#0 = a)) : +- Relation[id#0,val#1] parquet +- Filter isnotnull(id#4) +- Relation[id#4,val#5] parquet scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 'a'").queryExecution.optimizedPlan.children(0).constraints res40: org.apache.spark.sql.catalyst.expressions.ExpressionSet = Set() ``` After applying the above fix, the constraints of the union table is not empty anymore and the filter should be inferred properly. ``` scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id = t2.id AND t1.id = 'a' ").queryExecution.optimizedPlan res9: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Join Inner, (id#0 = id#16) :- Union : :- Filter (isnotnull(id#0) AND (id#0 = a)) : : +- Relation[id#0,val#1] parquet : :- LocalRelation <empty>, [id#0, val#1] : :- LocalRelation <empty>, [id#0, val#1] : +- Filter ((isnotnull(id#0) AND NOT id#0 IN (a,b,c)) AND (id#0 = a)) : +- Relation[id#0,val#1] parquet +- Filter ((id#16 = a) AND isnotnull(id#16)) +- Relation[id#16,val#17] parquet scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id = t2.id AND t1.id = 'a' ").queryExecution.optimizedPlan.children(0).constraints res10: org.apache.spark.sql.catalyst.expressions.ExpressionSet = Set(isnotnull(id#0), (id#0 = a)) scala> ``` ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review https://spark.apache.org/contributing.html before opening a pull request.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
