GitHub user viirya reopened a pull request:
https://github.com/apache/spark/pull/14847
[SPARK-17254][SQL] Filter can stop when the condition is false if the child
output is sorted
## What changes were proposed in this pull request?
From
https://issues.apache.org/jira/secure/attachment/12778890/BucketedTables.pdf:
Filter on sorted data
If the data is sorted by a key, filters on the key could stop as soon as
the data is out of range. For example, WHERE ticker_id < âFâ should stop as
soon as the first row starting with âFâ is seen. This can be done adding a
Filter operator that has âstop if falseâ semantics. This is generally
useful.
### Benchmark
val N = 500L << 18
val df = sparkSession.range(N).sort("id").persist()
runBenchmark("range/sort/filter/sum", N) {
df.filter("id < 10").groupBy().sum().collect()
}
Before this patch:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux
3.19.0-64-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sort/filter/sum: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
range/sort/filter/sum wholestage off 573 / 665 228.6
4.4 1.0X
range/sort/filter/sum wholestage on 273 / 363 479.6
2.1 2.1X
After this patch:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux
3.19.0-64-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sort/filter/sum: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
range/sort/filter/sum wholestage off 325 / 424 403.7
2.5 1.0X
range/sort/filter/sum wholestage on 254 / 289 516.2
1.9 1.3X
This patch can improve the wholestage off case significantly. For the
wholestage on case, looks like no significant difference. It might due to that
in wholestage on case we use `continue` to skip the filtering, but it still
needs to iterate all the rows.
## How was this patch tested?
Jenkins tests. Manually test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 filter-stop-early
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14847.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14847
----
commit ccac04f1e788dfb278618a10ee9220c89df6a61d
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-08-26T12:49:54Z
Filter can stop when the condition is false if the child output is sorted.
commit fd365cde14a6a94c33418522d300dca43a94ad92
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-08-26T12:49:54Z
Add StopAfter physical plan.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]