GitHub user wangyum opened a pull request:

    https://github.com/apache/spark/pull/21623

    [SPARK-24638][SQL] StringStartsWith support push down

    ## What changes were proposed in this pull request?
    
    `StringStartsWith` support push down. About 50% savings in compute time.
    
    ## How was this patch tested?
    unit tests and manual tests.
    Performance test:
    ```scala
    cat <<EOF > SPARK-24638.scala
    spark.range(10000000).selectExpr("concat(id, 'str', id) as 
id").coalesce(1).write.option("parquet.block.size", 
1048576).parquet("/tmp/spark/parquet/string")
    val df = spark.read.parquet("/tmp/spark/parquet/string/")
    spark.sql("set spark.sql.parquet.filterPushdown=true")
    val pushdownEnableStart = System.currentTimeMillis()
    for(i <- 0 until 100) {
      df.where("id like '999998%'").count()
    }
    val pushdownEnable = System.currentTimeMillis() - pushdownEnableStart
    
    spark.sql("set spark.sql.parquet.filterPushdown=false")
    val pushdownDisableStart = System.currentTimeMillis()
    for(i <- 0 until 100) {
      df.where("id like '999998%'").count()
    }
    val pushdownDisable = System.currentTimeMillis() - pushdownDisableStart
    
    val improvements = pushdownDisable.toDouble - pushdownEnable.toDouble
    
    println(s"improvements: ${improvements}")
    
    EOF
    
    bin/spark-shell -i SPARK-24638.scala
    ```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangyum/spark SPARK-24638

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21623.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21623
    
----
commit 5b52ace44c8a41631535c883b7a5c8545959e5e5
Author: Yuming Wang <yumwang@...>
Date:   2018-06-23T13:27:30Z

    StringStartsWith support push down

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to