Ji Jun Tang created SPARK-53742:
-----------------------------------

             Summary: Push down the filter used in the count_if function
                 Key: SPARK-53742
                 URL: https://issues.apache.org/jira/browse/SPARK-53742
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.0.1
            Reporter: Ji Jun Tang


By pushing down the filter condition in the count_if function, we can reduce 
the volume of data that needs to be processed.

 
{code:java}
// code placeholder
spark.sql("create table t1(a int, b int, c int) using parquet")
spark.sql("select count_if(a <>1) from t1").explain("cost") {code}
Current:
{code:java}
== Optimized Logical Plan ==
Aggregate [count(if (NOT _common_expr_0#6) null else _common_expr_0#6) AS 
count_if((NOT (a = 1)))#4L], Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Project [NOT (a#0 = 1) AS _common_expr_0#6], Statistics(sizeInBytes=1.0 B)
   +- Relation spark_catalog.default.t1[a#0,b#1,c#2] parquet, 
Statistics(sizeInBytes=0.0 B) {code}
Excepted:
{code:java}
== Optimized Logical Plan ==
Aggregate [count(if (NOT _common_expr_2#22) null else _common_expr_2#22) AS 
count_if((NOT (a = 1)))#21L], Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Project [NOT (a#3 = 1) AS _common_expr_2#22], Statistics(sizeInBytes=1.0 B)
   +- Filter (isnotnull(a#3) AND NOT (a#3 = 1)), Statistics(sizeInBytes=1.0 B)
      +- Relation spark_catalog.default.t1[a#3,b#4,c#5] parquet, 
Statistics(sizeInBytes=0.0 B) {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to