GitHub user maropu opened a pull request:
https://github.com/apache/spark/pull/18882
[SPARK-21652][SQL] Filter out meaningless constraints inferred in
inferAdditionalConstraints
## What changes were proposed in this pull request?
This pr added code to filter out meaningless constraints inferred in
`inferAdditionalConstraints` (e.g., given constraint `a = 1`, `b = 1`, `a = c`,
and `b = c`, we inferred `a = b` and this predicate was trivially true). These
constraints possibly cause some `Optimizer` overhead and, for example;
```
scala> Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
scala> Seq(1, 2).toDF("col").write.saveAsTable("t2")
scala> spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2
AND t1.col1 = t2.col AND t1.col2 = t2.col").explain(true)
```
In this query, `InferFiltersFromConstraints` infers a new constraint
'(col2#33 = col1#32)' that is appended to the join condition, then
`PushPredicateThroughJoin` pushes it down, `ConstantPropagation` replaces
'(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints,
`ConstantFolding` replaces '1 = 1' with 'true and `BooleanSimplification`
finally removes this predicate. However, `InferFiltersFromConstraints` will
again infer '(col2#33 = col1#32)' on the next iteration and the process will
continue until the limit of iterations is reached.
See below for more details
```
=== Applying Rule
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
!Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) &&
(col2#33 = col#34)))
:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) &&
(1 = col2#33))) :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) &&
((col1#32 = 1) && (1 = col2#33)))
: +- Relation[col1#32,col2#33] parquet
: +- Relation[col1#32,col2#33] parquet
+- Filter ((1 = col#34) && isnotnull(col#34))
+- Filter ((1 = col#34) && isnotnull(col#34))
+- Relation[col#34] parquet
+- Relation[col#34] parquet
=== Applying Rule
org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
!Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 =
col#34))) Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
!:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) &&
(1 = col2#33))) :- Filter (col2#33 = col1#32)
!: +- Relation[col1#32,col2#33] parquet
: +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) &&
((col1#32 = 1) && (1 = col2#33)))
!+- Filter ((1 = col#34) && isnotnull(col#34))
: +- Relation[col1#32,col2#33] parquet
! +- Relation[col#34] parquet
+- Filter ((1 = col#34) && isnotnull(col#34))
!
+- Relation[col#34] parquet
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters ===
Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
!:- Filter (col2#33 = col1#32)
:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) &&
((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32))
!: +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1)
&& (1 = col2#33))) : +- Relation[col1#32,col2#33] parquet
!: +- Relation[col1#32,col2#33] parquet
+- Filter ((1 = col#34) && isnotnull(col#34))
!+- Filter ((1 = col#34) && isnotnull(col#34))
+- Relation[col#34] parquet
! +- Relation[col#34] parquet
=== Applying Rule
org.apache.spark.sql.catalyst.optimizer.ConstantPropagation ===
Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
Join Inner, ((col1#32 = col#34) &&
(col2#33 = col#34))
!:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1)
&& (1 = col2#33))) && (col2#33 = col1#32)) :- Filter (((isnotnull(col1#32) &&
isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && (1 = 1))
: +- Relation[col1#32,col2#33] parquet
: +- Relation[col1#32,col2#33]
parquet
+- Filter ((1 = col#34) && isnotnull(col#34))
+- Filter ((1 = col#34) &&
isnotnull(col#34))
+- Relation[col#34] parquet
+- Relation[col#34] parquet
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantFolding
===
Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
Join Inner, ((col1#32 = col#34) && (col2#33 =
col#34))
!:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1)
&& (1 = col2#33))) && (1 = 1)) :- Filter (((isnotnull(col1#32) &&
isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && true)
: +- Relation[col1#32,col2#33] parquet
: +- Relation[col1#32,col2#33] parquet
+- Filter ((1 = col#34) && isnotnull(col#34))
+- Filter ((1 = col#34) && isnotnull(col#34))
+- Relation[col#34] parquet
+- Relation[col#34] parquet
=== Applying Rule
org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
Join Inner, ((col1#32 = col#34) && (col2#33 =
col#34))
!:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1)
&& (1 = col2#33))) && true) :- Filter ((isnotnull(col1#32) &&
isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33)))
: +- Relation[col1#32,col2#33] parquet
: +- Relation[col1#32,col2#33] parquet
+- Filter ((1 = col#34) && isnotnull(col#34))
+- Filter ((1 = col#34) && isnotnull(col#34))
+- Relation[col#34] parquet
+- Relation[col#34] parquet
=== Applying Rule
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
!Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) &&
(col2#33 = col#34)))
:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) &&
(1 = col2#33))) :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) &&
((col1#32 = 1) && (1 = col2#33)))
: +- Relation[col1#32,col2#33] parquet
: +- Relation[col1#32,col2#33] parquet
+- Filter ((1 = col#34) && isnotnull(col#34))
+- Filter ((1 = col#34) && isnotnull(col#34))
+- Relation[col#34] parquet
```
## How was this patch tested?
Added tests in `InferFiltersFromConstraintsSuite`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/maropu/spark SPARK-21652
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18882.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18882
----
commit d253e40788b9e3408c106eff0ba84ae97d715cbb
Author: Takeshi Yamamuro <[email protected]>
Date: 2017-08-08T11:08:38Z
Should not infer the constraints that are trivially true
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]