Ji Jun Tang created SPARK-53742:
-----------------------------------
Summary: Push down the filter used in the count_if function
Key: SPARK-53742
URL: https://issues.apache.org/jira/browse/SPARK-53742
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.0.1
Reporter: Ji Jun Tang
By pushing down the filter condition in the count_if function, we can reduce
the volume of data that needs to be processed.
{code:java}
// code placeholder
spark.sql("create table t1(a int, b int, c int) using parquet")
spark.sql("select count_if(a <>1) from t1").explain("cost") {code}
Current:
{code:java}
== Optimized Logical Plan ==
Aggregate [count(if (NOT _common_expr_0#6) null else _common_expr_0#6) AS
count_if((NOT (a = 1)))#4L], Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Project [NOT (a#0 = 1) AS _common_expr_0#6], Statistics(sizeInBytes=1.0 B)
+- Relation spark_catalog.default.t1[a#0,b#1,c#2] parquet,
Statistics(sizeInBytes=0.0 B) {code}
Excepted:
{code:java}
== Optimized Logical Plan ==
Aggregate [count(if (NOT _common_expr_2#22) null else _common_expr_2#22) AS
count_if((NOT (a = 1)))#21L], Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Project [NOT (a#3 = 1) AS _common_expr_2#22], Statistics(sizeInBytes=1.0 B)
+- Filter (isnotnull(a#3) AND NOT (a#3 = 1)), Statistics(sizeInBytes=1.0 B)
+- Relation spark_catalog.default.t1[a#3,b#4,c#5] parquet,
Statistics(sizeInBytes=0.0 B) {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]