GitHub user liancheng opened a pull request:
https://github.com/apache/spark/pull/2188
[SPARK-2961][SQL] Use statistics to prune batches within cached partitions
This PR is based on #1883 authored by @marmbrus. Key differences:
1. Batch pruning instead of partition pruning
When #1883 was authored, batched column buffer building (#1880) hadn't
been introduced. This PR combines these two and provide partition batch level
pruning, which leads to smaller memory footprints and can generally skip more
elements. The cost is that the pruning predicates are evaluated more frequently
(partition number multiplies batch number per partition).
1. More filters are supported
Filter predicates consist of `=`, `<`, `<=`, `>`, `>=` and their
conjunctions are supported. Disjunctions are not supported yet. In short,
```sql
SELECT * FROM t WHERE i > 0 AND i < 10
```
is executed in a more smart way, while
```sql
SELECT * FROM t WHERE i < 0 OR i > 10
```
still scans the whole table.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/liancheng/spark in-mem-batch-pruning
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2188.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2188
----
commit 1a31cd519b999848f0b76c94a9e258b17d977cc6
Author: Cheng Lian <[email protected]>
Date: 2014-08-16T09:49:34Z
Merge branch 'inMemStats' into in-mem-batch-pruning
Tried to combine Michael's partition pruning branch and the batched
column buffer building. In this way, we actually got "batch pruning"
rather than partition pruning.
Conflicts:
sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
Conflicts:
sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
commit 520f41a0618a9529dfa86e2f748db405a4802f4f
Author: Cheng Lian <[email protected]>
Date: 2014-08-18T02:44:38Z
Added more predicate filters, fixed table scan stats for testing purposes
commit 77a2203ceb18546ddee2373af29036971bebd572
Author: Cheng Lian <[email protected]>
Date: 2014-08-28T23:00:44Z
Code cleanup, bugfix, and adding tests
* Bugfix: gatherStats() should be called in NullableColumnBuilder,
otherwise null values are skipped
* Bugfix: fixed lower bound comparison in StringColumnStats and
TimestampColumnStats
commit 46890319002be21061bcb1931ea9ccb43063fcf0
Author: Cheng Lian <[email protected]>
Date: 2014-08-29T00:17:21Z
More test cases
commit 04a956d6ed415b0ecfa717f7067231026ba25414
Author: Cheng Lian <[email protected]>
Date: 2014-08-29T00:18:02Z
Renamed PartitionSkippingSuite to PartitionBatchPruningSuite
commit ef057747fa3d9f85615c5650873abc374e21e16d
Author: Cheng Lian <[email protected]>
Date: 2014-08-29T00:20:20Z
Minor code cleanup
commit b4a82810b3567a9a7ba085750f6f03103ed83ae7
Author: Cheng Lian <[email protected]>
Date: 2014-08-29T00:22:32Z
Minor code cleanup
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]