wangyum commented on a change in pull request #28642:
URL: https://github.com/apache/spark/pull/28642#discussion_r739098037
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1215,6 +1215,15 @@ object InferFiltersFromConstraints extends
Rule[LogicalPlan]
}
}
+ // Whether the result of this expression may be null. For example:
CAST(strCol AS double)
+ // We will infer an IsNotNull expression for this expression to avoid skew
join.
Review comment:
We can infer `IsNotNull(col)` already. For example:
```scala
spark.sql("create table t1 (id string, value int) using parquet")
spark.sql("create table t2 (id int, value int) using parquet")
spark.sql("select * from t1 join t2 on t1.id = t2.id").explain("extended")
```
Before this pr:
```
== Optimized Logical Plan ==
Join Inner, (cast(id#0 as int) = id#2)
:- Filter isnotnull(id#0)
: +- Relation default.t1[id#0,value#1] parquet
+- Filter isnotnull(id#2)
+- Relation default.t2[id#2,value#3] parquet
```
After this pr:
```
== Optimized Logical Plan ==
Join Inner, (cast(id#0 as int) = id#2)
:- Filter (isnotnull(id#0) AND isnotnull(cast(id#0 as int)))
: +- Relation default.t1[id#0,value#1] parquet
+- Filter isnotnull(id#2)
+- Relation default.t2[id#2,value#3] parquet
```
Infer `isnotnull(cast(t1.id as int))` may filter out many strings that can
not be casted to int.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]