Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/20265#discussion_r161425712
--- Diff:
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala ---
@@ -483,6 +484,64 @@ object OrcReadBenchmark {
}
}
+ def filterPushDownBenchmark(values: Int, width: Int): Unit = {
+ val benchmark = new Benchmark(s"Filter Pushdown", values)
+
+ withTempPath { dir =>
+ withTempTable("t1", "nativeOrcTable", "hiveOrcTable") {
+ import spark.implicits._
+ val selectExpr = (1 to width).map(i => s"CAST(value AS STRING)
c$i")
+ val whereExpr = (1 to width).map(i => s"NOT c$i LIKE
'%not%exist%'").mkString(" AND ")
--- End diff --
Ur, @cloud-fan and @gatorsmile .
The best case for PPD is **Spark needs to do lots of processing on the
returned rows but ORC reader only returns one stripe with minimal CPU code**.
So, I designed this benchmark in order to show the difference clearly.
1. The push-downed predicate is only `uniqueID = 0` (minimal). We can
change that into `uniqueID ==` or `uniqueID >`.
2. `LIKE` predicate is chosed because it's not pushed down and makes Spark
do more processing. It's just one of the example of that kind of operation. You
can ignore thoses predicates. We can choose some UDFs instead.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]