[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

liancheng Thu, 28 Aug 2014 17:38:47 -0700

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/2188


    [SPARK-2961][SQL] Use statistics to prune batches within cached partitions

    This PR is based on #1883 authored by @marmbrus. Key differences:
    
    1. Batch pruning instead of partition pruning
    
       When #1883 was authored, batched column buffer building (#1880) hadn't 
been introduced. This PR combines these two and provide partition batch level 
pruning, which leads to smaller memory footprints and can generally skip more 
elements. The cost is that the pruning predicates are evaluated more frequently 
(partition number multiplies batch number per partition).
    
    1. More filters are supported
    
       Filter predicates consist of `=`, `<`, `<=`, `>`, `>=` and their 
conjunctions are supported. Disjunctions are not supported yet. In short, 
    
       ```sql
       SELECT * FROM t WHERE i > 0 AND i < 10
       ```
    
       is executed in a more smart way, while
    
       ```sql
       SELECT * FROM t WHERE i < 0 OR i > 10
       ```
    
       still scans the whole table.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark in-mem-batch-pruning

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2188.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2188
    
----
commit 1a31cd519b999848f0b76c94a9e258b17d977cc6
Author: Cheng Lian <[email protected]>
Date:   2014-08-16T09:49:34Z

    Merge branch 'inMemStats' into in-mem-batch-pruning
    
    Tried to combine Michael's partition pruning branch and the batched
    column buffer building. In this way, we actually got "batch pruning"
    rather than partition pruning.
    
    Conflicts:
        
sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
    
    Conflicts:
        
sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala

commit 520f41a0618a9529dfa86e2f748db405a4802f4f
Author: Cheng Lian <[email protected]>
Date:   2014-08-18T02:44:38Z

    Added more predicate filters, fixed table scan stats for testing purposes

commit 77a2203ceb18546ddee2373af29036971bebd572
Author: Cheng Lian <[email protected]>
Date:   2014-08-28T23:00:44Z

    Code cleanup, bugfix, and adding tests
    
    * Bugfix: gatherStats() should be called in NullableColumnBuilder,
      otherwise null values are skipped
    * Bugfix: fixed lower bound comparison in StringColumnStats and
      TimestampColumnStats

commit 46890319002be21061bcb1931ea9ccb43063fcf0
Author: Cheng Lian <[email protected]>
Date:   2014-08-29T00:17:21Z

    More test cases

commit 04a956d6ed415b0ecfa717f7067231026ba25414
Author: Cheng Lian <[email protected]>
Date:   2014-08-29T00:18:02Z

    Renamed PartitionSkippingSuite to PartitionBatchPruningSuite

commit ef057747fa3d9f85615c5650873abc374e21e16d
Author: Cheng Lian <[email protected]>
Date:   2014-08-29T00:20:20Z

    Minor code cleanup

commit b4a82810b3567a9a7ba085750f6f03103ed83ae7
Author: Cheng Lian <[email protected]>
Date:   2014-08-29T00:22:32Z

    Minor code cleanup

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Reply via email to