[GitHub] spark pull request #19810: Partition level pruning 2

CodingCat Thu, 23 Nov 2017 23:43:39 -0800

GitHub user CodingCat opened a pull request:

    https://github.com/apache/spark/pull/19810


    Partition level pruning 2

    ## What changes were proposed in this pull request?
    
    In the current implementation of Spark, InMemoryTableExec read all data in 
a cached table, filter CachedBatch according to stats and pass data to the 
downstream operators. This implementation makes it inefficient to reside the 
whole table in memory to serve various queries against different partitions of 
the table, which occupies a certain portion of our users' scenarios.
    
    The following is an example of such a use case:
    
    store_sales is a 1TB-sized table in cloud storage, which is partitioned by 
'location'. The first query, Q1, wants to output several metrics A, B, C for 
all stores in all locations. After that, a small team of 3 data scientists 
wants to do some causal analysis for the sales in different locations. To avoid 
unnecessary I/O and parquet/orc parsing overhead, they want to cache the whole 
table in memory in Q1.
    
    With the current implementation, even any one of the data scientists is 
only interested in one out of three locations, the queries they submit to Spark 
cluster is still reading 1TB data completely.
    
    The reason behind the extra reading operation is that we implement 
CachedBatch as
    
    case class CachedBatch(numRows: Int, buffers: Array[Array[Byte]], stats: 
InternalRow)
    where the stats is a part of every CachedBatch, so we can only filter 
batches for output of InMemoryTableExec operator by reading all data in 
in-memory table as input. The extra reading would be even more unacceptable 
when some of the table's data is evicted to disks.
    
    We propose to introduce a new type of block, metadata block, for the 
partitions of RDD representing data in the cached table. Every metadata block 
contains stats info for all columns in a partition and is saved to BlockManager 
when executing compute() method for the partition. To minimize the number of 
bytes to read,
    
    ## How was this patch tested?
    
    (TBD, post it soon)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/CodingCat/spark partition_level_pruning_2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19810.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19810
    
----
commit c3f1a9bb4b53ceef8ad3fb31b3164665549f8bc9
Author: CodingCat <[email protected]>
Date:   2016-03-07T14:37:37Z

    improve the doc for "spark.memory.offHeap.size"

commit 6a2b3ca56d26f5fb03d165466e9b6edadeb0adac
Author: CodingCat <[email protected]>
Date:   2016-03-07T19:00:16Z

    fix

commit 6e37fa2d06107c175cede4ff1b1a65dd14de7d5a
Author: CodingCat <[email protected]>
Date:   2017-10-27T17:36:08Z

    add configuration for partition_metadata

commit aa7066038e23cea7c5eb720aa70cd1a84d6f751f
Author: CodingCat <[email protected]>
Date:   2017-10-27T19:49:33Z

    framework of CachedColumnarRDD

commit d1380821d4a9cbb45f5b28c775939e0d130f0922
Author: CodingCat <[email protected]>
Date:   2017-10-27T22:53:38Z

    code framework

commit a72d7798b17000250eda0bcc8ae726b7dce7aa0c
Author: CodingCat <[email protected]>
Date:   2017-10-30T20:41:18Z

    remove cachedcolumnarbatchRDD

commit 0fe35f82fcd9175b409429c03c9f7b33712df8ae
Author: CodingCat <[email protected]>
Date:   2017-10-30T21:12:24Z

    fix styly error

commit 9e342432a82cbe325f338bd787be6427002981e1
Author: CodingCat <[email protected]>
Date:   2017-10-31T00:01:53Z

    temp

commit 677ca81709fa34b3cad13f5843b8f408e401476b
Author: CodingCat <[email protected]>
Date:   2017-11-01T23:01:49Z

    'CachedColumnarRDD'

commit df1d79620f7c8073b6bc1a119a245c3d5413ec71
Author: CodingCat <[email protected]>
Date:   2017-11-01T23:11:52Z

    change types

commit 08fd0857024a192307e66b6c3cffb19ae879000b
Author: CodingCat <[email protected]>
Date:   2017-11-01T23:16:58Z

    fix compilation error

commit d4fc2b772b60161a24892133bdae2ff28233250a
Author: CodingCat <[email protected]>
Date:   2017-11-02T17:13:15Z

    update

commit 97a63d6d1c1cd81bb97f0b998e716b13f5bd92d9
Author: CodingCat <[email protected]>
Date:   2017-11-02T17:21:23Z

    fix storage level

commit a24b7bbfa6c393974d553ba703777934bed85ec1
Author: CodingCat <[email protected]>
Date:   2017-11-02T17:42:25Z

    fix getOrCompute

commit 0e8e6395df97b4adeabb5a32ad441b96e5c19eb9
Author: CodingCat <[email protected]>
Date:   2017-11-02T17:55:36Z

    evaluate with partition metadata

commit b89d58b26650ce9b63713a8a2371e280986720bb
Author: CodingCat <[email protected]>
Date:   2017-11-02T18:21:19Z

    fix getOrCompute

commit 3f2eae73cefd5676c799a6fd1e384556ee6c33a8
Author: CodingCat <[email protected]>
Date:   2017-11-02T19:06:49Z

    add logging

commit 507c1a22da7de8b5bd24c066ec61a1c5f1c604dd
Author: CodingCat <[email protected]>
Date:   2017-11-02T19:12:57Z

    add logging for skipped partition

commit 40d441cc6985aa995e5c7fe1387744381425e6ef
Author: CodingCat <[email protected]>
Date:   2017-11-02T20:15:29Z

    try to print stats

commit 520e5aab3a06b9c1599af3984d4d2d028ced2ad6
Author: CodingCat <[email protected]>
Date:   2017-11-02T20:20:43Z

    add logging for skipped partition

commit 885808fc0c0f78d0ce7bbb5a5d06222bb06bf2cb
Author: CodingCat <[email protected]>
Date:   2017-11-02T20:25:18Z

    add logging for skipped partition

commit 37b5971524c4bb7f42b23a2ea013564ce3f6015c
Author: CodingCat <[email protected]>
Date:   2017-11-02T21:14:57Z

    add logging for skipped partition

commit 4dbfe37d6d719e7755bb6c9160262201908d36f4
Author: CodingCat <[email protected]>
Date:   2017-11-09T21:23:33Z

    refactor the code

commit 6165838371bbe345f6bc9d462671ffb3f32cbb94
Author: CodingCat <[email protected]>
Date:   2017-11-09T21:30:57Z

    fix compilation issue

commit 05f226772e26ab81b764632a5d979eca4e4deffd
Author: CodingCat <[email protected]>
Date:   2017-11-09T21:33:51Z

    refactor the code

commit bcafe822a879bf48982f6d9f3255cbc7a5e8236c
Author: CodingCat <[email protected]>
Date:   2017-11-09T21:49:14Z

    test

commit 5b888d303a2b68fd36c293286e8bf6e67c06802c
Author: CodingCat <[email protected]>
Date:   2017-11-09T21:50:11Z

    fix compilation issue

commit 977b93f4b67ff5e57f1d0081b5aead70a90ac13f
Author: CodingCat <[email protected]>
Date:   2017-11-09T22:11:04Z

    add missing filtering

commit 9c9bcadf6f7f2ae90393f1ab26db595a0a836c99
Author: CodingCat <[email protected]>
Date:   2017-11-09T22:16:43Z

    test

commit 56a430722fc7b76b3f2365eee4d94e6ab62380a6
Author: CodingCat <[email protected]>
Date:   2017-11-09T22:22:23Z

    test

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19810: Partition level pruning 2

Reply via email to