GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21623
[SPARK-24638][SQL] StringStartsWith support push down ## What changes were proposed in this pull request? `StringStartsWith` support push down. About 50% savings in compute time. ## How was this patch tested? unit tests and manual tests. Performance test: ```scala cat <<EOF > SPARK-24638.scala spark.range(10000000).selectExpr("concat(id, 'str', id) as id").coalesce(1).write.option("parquet.block.size", 1048576).parquet("/tmp/spark/parquet/string") val df = spark.read.parquet("/tmp/spark/parquet/string/") spark.sql("set spark.sql.parquet.filterPushdown=true") val pushdownEnableStart = System.currentTimeMillis() for(i <- 0 until 100) { df.where("id like '999998%'").count() } val pushdownEnable = System.currentTimeMillis() - pushdownEnableStart spark.sql("set spark.sql.parquet.filterPushdown=false") val pushdownDisableStart = System.currentTimeMillis() for(i <- 0 until 100) { df.where("id like '999998%'").count() } val pushdownDisable = System.currentTimeMillis() - pushdownDisableStart val improvements = pushdownDisable.toDouble - pushdownEnable.toDouble println(s"improvements: ${improvements}") EOF bin/spark-shell -i SPARK-24638.scala ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-24638 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21623.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21623 ---- commit 5b52ace44c8a41631535c883b7a5c8545959e5e5 Author: Yuming Wang <yumwang@...> Date: 2018-06-23T13:27:30Z StringStartsWith support push down ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org