GitHub user wangyum opened a pull request:
https://github.com/apache/spark/pull/21623
[SPARK-24638][SQL] StringStartsWith support push down
## What changes were proposed in this pull request?
`StringStartsWith` support push down. About 50% savings in compute time.
## How was this patch tested?
unit tests and manual tests.
Performance test:
```scala
cat <<EOF > SPARK-24638.scala
spark.range(10000000).selectExpr("concat(id, 'str', id) as
id").coalesce(1).write.option("parquet.block.size",
1048576).parquet("/tmp/spark/parquet/string")
val df = spark.read.parquet("/tmp/spark/parquet/string/")
spark.sql("set spark.sql.parquet.filterPushdown=true")
val pushdownEnableStart = System.currentTimeMillis()
for(i <- 0 until 100) {
df.where("id like '999998%'").count()
}
val pushdownEnable = System.currentTimeMillis() - pushdownEnableStart
spark.sql("set spark.sql.parquet.filterPushdown=false")
val pushdownDisableStart = System.currentTimeMillis()
for(i <- 0 until 100) {
df.where("id like '999998%'").count()
}
val pushdownDisable = System.currentTimeMillis() - pushdownDisableStart
val improvements = pushdownDisable.toDouble - pushdownEnable.toDouble
println(s"improvements: ${improvements}")
EOF
bin/spark-shell -i SPARK-24638.scala
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wangyum/spark SPARK-24638
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21623.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21623
----
commit 5b52ace44c8a41631535c883b7a5c8545959e5e5
Author: Yuming Wang <yumwang@...>
Date: 2018-06-23T13:27:30Z
StringStartsWith support push down
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]