GitHub user viirya reopened a pull request:
https://github.com/apache/spark/pull/14847
[SPARK-17254][SQL] Add StopAfter physical plan for the filtering that can
be stopped early
## What changes were proposed in this pull request?
This is motivated by:
From
https://issues.apache.org/jira/secure/attachment/12778890/BucketedTables.pdf:
For some data which are sorted such as bucketed table with SORT BY, the
filtering on it can be stopped early if the filtering condition satisfies some
requirements. For example, the filtering like âWHERE id < 10â can be
stopped as soon as the first row on which the id column is with values equal to
or more than 10 is seen. I.e., once the condition is false, the filtering can
be stopped.
This pr adds a new `StopAfterExec` physical plan to achieve this.
Design document:
https://issues.apache.org/jira/secure/attachment/12833704/stop-after-physical-plan.pdf
### Benchmark
val N = 500L << 18
val df = sparkSession.range(N).sort("id").persist()
runBenchmark("range/sort/filter/sum", N) {
df.filter("id < 10").groupBy().sum().collect()
}
Before this patch:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14 on Linux 4.4.20-moby
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sort/filter/sum: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
range/sort/filter/sum wholestage off 266 / 389 493.4
2.0 1.0X
range/sort/filter/sum wholestage on 170 / 208 771.9
1.3 1.6X
After this patch:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14 on Linux 4.4.20-moby
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sort/filter/sum: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
range/sort/filter/sum wholestage off 141 / 207 928.4
1.1 1.0X
range/sort/filter/sum wholestage on 49 / 67 2694.1
0.4 2.9X
## How was this patch tested?
Jenkins tests. Manually test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 filter-stop-early
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14847.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14847
----
commit ccac04f1e788dfb278618a10ee9220c89df6a61d
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-08-26T12:49:54Z
Filter can stop when the condition is false if the child output is sorted.
commit fd365cde14a6a94c33418522d300dca43a94ad92
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-08-26T12:49:54Z
Add StopAfter physical plan.
commit 293474a6a9229630e4e31e51f07c7b8821228f41
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-10-14T05:36:00Z
rename StopAfter to StopAfterExec.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]