[GitHub] [spark] caican00 commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

GitBox Mon, 22 Aug 2022 04:41:19 -0700


caican00 commented on code in PR #37479:
URL: https://github.com/apache/spark/pull/37479#discussion_r951334595



##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2549,6 +2551,47 @@ class DataSourceV2SQLSuite
     }
   }
 
+  test("SPARK-40045: Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    withTable(t1) {
+      spark.udf.register("udfStrLen", (str: String) => str.length)
+      sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
+      sql(s"INSERT INTO $t1 VALUES (1, 'a'), (2, 'b'), (3, 'c')")
+
+      val filterBefore = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where udfStrLen(data) = 1
+           |and trim(data) = 'a'
+           |and id =2
+           |""".stripMargin
+      )
+      val filtersBefore = 
find(filterBefore.queryExecution.executedPlan)(_.isInstanceOf[FilterExec])
+        .head.asInstanceOf[FilterExec]
+        .condition.toString
+        .split("AND")
+      assert(filtersBefore.length == 5
+        && filtersBefore(3).trim.startsWith("(udfStrLen(data")
+        && filtersBefore(4).trim.startsWith("(trim(data"))
+
+      val filterAfter = spark.sql(
+        s"""
+           |SELECT id, data FROM $t1
+           |where id =2
+           |and udfStrLen(data) = 1
+           |and trim(data) = 'a'
+           |""".stripMargin
+      )
+      val filtersAfter = 
find(filterAfter.queryExecution.executedPlan)(_.isInstanceOf[FilterExec])
+        .head.asInstanceOf[FilterExec]
+        .condition.toString
+        .split("AND")
+      assert(filtersAfter.length == 5
+        && filtersAfter(3).trim.startsWith("(udfStrLen(data")

Review Comment:
   > wait, shouldn't this udf in the far right?
   
   @cloud-fan  In the following SQL:
   ```
   SELECT id, data FROM testcat.ns1.ns2.table
   where udfStrLen(data) = 1
   and trim(data) = 'a'
   and id =2
   ```
   udfStrLen and trim functions are untranslatable and they're on the far right 
with respect to `id =2`.
   
   ```
   == Physical Plan ==
   *(1) Project [id#24L, data#25]
   +- *(1) Filter (((isnotnull(id#24L) AND (id#24L = 2)) AND 
(udfStrLen(data#25) = 1)) AND (trim(data#25, None) = a))
      +- BatchScan[id#24L, data#25] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] caican00 commented on a diff in pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates

Reply via email to