Github user aokolnychyi commented on the issue:
https://github.com/apache/spark/pull/18692
@gatorsmile I took a look at the case above. Indeed, the proposed rule
triggers this issue but only indirectly. In the example above, the optimizer
will never reach a fixed point. Please, find my investigation below.
```
...
// The new rule infers correct join predicates
Join Inner, ((col2#33 = col#34) && (col1#32 = col#34))
:- Filter ((col1#32 = col2#33) && (col1#32 = 1))
: +- Relation[col1#32,col2#33] parquet
+- Filter (col#34 = 1)
+- Relation[col#34] parquet
// InferFiltersFromConstraints adds more filters
Join Inner, ((col2#33 = col#34) && (col1#32 = col#34))
:- Filter ((((col2#33 = 1) && isnotnull(col1#32)) && isnotnull(col2#33)) &&
((col1#32 = col2#33) && (col1#32 = 1)))
: +- Relation[col1#32,col2#33] parquet
+- Filter (isnotnull(col#34) && (col#34 = 1))
+- Relation[col#34] parquet
// ConstantPropagation is applied
Join Inner, ((col2#33 = col#34) && (col1#32 = col#34))
!:- Filter (((((col2#33 = 1) && isnotnull(col2#33)) && isnotnull(col1#32))
&& ((1 = col2#33) && (col1#32 = 1)))
: +- Relation[col1#32,col2#33] parquet
+- Filter (isnotnull(col#34) && (col#34 = 1))
+- Relation[col#34] parquet
// (Important) InferFiltersFromConstraints infers (col1#32 = col2#33),
which is added to the join condition.
Join Inner, ((col1#32 = col2#33) && ((col2#33 = col#34) && (col1#32 =
col#34)))
!:- Filter (((((col2#33 = 1) && isnotnull(col2#33)) && isnotnull(col1#32))
&& ((1 = col2#33) && (col1#32 = 1)))
: +- Relation[col1#32,col2#33] parquet
+- Filter (isnotnull(col#34) && (col#34 = 1))
+- Relation[col#34] parquet
// PushPredicateThroughJoin pushes down (col1#32 = col2#33) and then
CombineFilters produces
Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
!:- Filter ((((isnotnull(col1#32) && (col2#33 = 1)) && isnotnull(col2#33))
&& ((1 = col2#33) && (col1#32 = 1))) && (col2#33 = col1#32))
: +- Relation[col1#32,col2#33] parquet
+- Filter (isnotnull(col#34) && (col#34 = 1))
+- Relation[col#34] parquet
```
After that, `ConstantPropagation` replaces `(col2#33 = col1#32)` as `(1 =
1)`, `BooleanSimplification` removes `(1 = 1)`, `InferFiltersFromConstraints`
infers `(col2#33 = col1#32)` again and the procedure repeats forever. Since
`InferFiltersFromConstraints` is the last optimization rule, we have the
redundant condition mentioned by you. The Optimizer without the new rule will
also not converge on the following query:
```
Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
Seq(1, 2).toDF("col").write.saveAsTable("t2")
spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND
t1.col1 = t2.col AND t1.col2 = t2.col").explain(true)
```
Correct me if I am wrong, but it seems like an issue with the existing
rules.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]