[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...

dongjoon-hyun Sun, 14 Jan 2018 19:28:26 -0800

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20265#discussion_r161425712
  
    --- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala ---
    @@ -483,6 +484,64 @@ object OrcReadBenchmark {
         }
       }
     
    +  def filterPushDownBenchmark(values: Int, width: Int): Unit = {
    +    val benchmark = new Benchmark(s"Filter Pushdown", values)
    +
    +    withTempPath { dir =>
    +      withTempTable("t1", "nativeOrcTable", "hiveOrcTable") {
    +        import spark.implicits._
    +        val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) 
c$i")
    +        val whereExpr = (1 to width).map(i => s"NOT c$i LIKE 
'%not%exist%'").mkString(" AND ")
    --- End diff --
    
    Ur, @cloud-fan and @gatorsmile .
    
    The best case for PPD is **Spark needs to do lots of processing on the 
returned rows but ORC reader only returns one stripe with minimal CPU code**.
    
    So, I designed this benchmark in order to show the difference clearly.
    1. The push-downed predicate is only `uniqueID = 0` (minimal). We can 
change that into `uniqueID ==` or `uniqueID >`.
    2. `LIKE` predicate is chosed because it's not pushed down and makes Spark 
do more processing. It's just one of the example of that kind of operation. You 
can ignore thoses predicates. We can choose some UDFs instead.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...

Reply via email to