GitHub user CodingCat opened a pull request:
https://github.com/apache/spark/pull/19810
Partition level pruning 2
## What changes were proposed in this pull request?
In the current implementation of Spark, InMemoryTableExec read all data in
a cached table, filter CachedBatch according to stats and pass data to the
downstream operators. This implementation makes it inefficient to reside the
whole table in memory to serve various queries against different partitions of
the table, which occupies a certain portion of our users' scenarios.
The following is an example of such a use case:
store_sales is a 1TB-sized table in cloud storage, which is partitioned by
'location'. The first query, Q1, wants to output several metrics A, B, C for
all stores in all locations. After that, a small team of 3 data scientists
wants to do some causal analysis for the sales in different locations. To avoid
unnecessary I/O and parquet/orc parsing overhead, they want to cache the whole
table in memory in Q1.
With the current implementation, even any one of the data scientists is
only interested in one out of three locations, the queries they submit to Spark
cluster is still reading 1TB data completely.
The reason behind the extra reading operation is that we implement
CachedBatch as
case class CachedBatch(numRows: Int, buffers: Array[Array[Byte]], stats:
InternalRow)
where the stats is a part of every CachedBatch, so we can only filter
batches for output of InMemoryTableExec operator by reading all data in
in-memory table as input. The extra reading would be even more unacceptable
when some of the table's data is evicted to disks.
We propose to introduce a new type of block, metadata block, for the
partitions of RDD representing data in the cached table. Every metadata block
contains stats info for all columns in a partition and is saved to BlockManager
when executing compute() method for the partition. To minimize the number of
bytes to read,
## How was this patch tested?
(TBD, post it soon)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/CodingCat/spark partition_level_pruning_2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19810.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19810
----
commit c3f1a9bb4b53ceef8ad3fb31b3164665549f8bc9
Author: CodingCat <[email protected]>
Date: 2016-03-07T14:37:37Z
improve the doc for "spark.memory.offHeap.size"
commit 6a2b3ca56d26f5fb03d165466e9b6edadeb0adac
Author: CodingCat <[email protected]>
Date: 2016-03-07T19:00:16Z
fix
commit 6e37fa2d06107c175cede4ff1b1a65dd14de7d5a
Author: CodingCat <[email protected]>
Date: 2017-10-27T17:36:08Z
add configuration for partition_metadata
commit aa7066038e23cea7c5eb720aa70cd1a84d6f751f
Author: CodingCat <[email protected]>
Date: 2017-10-27T19:49:33Z
framework of CachedColumnarRDD
commit d1380821d4a9cbb45f5b28c775939e0d130f0922
Author: CodingCat <[email protected]>
Date: 2017-10-27T22:53:38Z
code framework
commit a72d7798b17000250eda0bcc8ae726b7dce7aa0c
Author: CodingCat <[email protected]>
Date: 2017-10-30T20:41:18Z
remove cachedcolumnarbatchRDD
commit 0fe35f82fcd9175b409429c03c9f7b33712df8ae
Author: CodingCat <[email protected]>
Date: 2017-10-30T21:12:24Z
fix styly error
commit 9e342432a82cbe325f338bd787be6427002981e1
Author: CodingCat <[email protected]>
Date: 2017-10-31T00:01:53Z
temp
commit 677ca81709fa34b3cad13f5843b8f408e401476b
Author: CodingCat <[email protected]>
Date: 2017-11-01T23:01:49Z
'CachedColumnarRDD'
commit df1d79620f7c8073b6bc1a119a245c3d5413ec71
Author: CodingCat <[email protected]>
Date: 2017-11-01T23:11:52Z
change types
commit 08fd0857024a192307e66b6c3cffb19ae879000b
Author: CodingCat <[email protected]>
Date: 2017-11-01T23:16:58Z
fix compilation error
commit d4fc2b772b60161a24892133bdae2ff28233250a
Author: CodingCat <[email protected]>
Date: 2017-11-02T17:13:15Z
update
commit 97a63d6d1c1cd81bb97f0b998e716b13f5bd92d9
Author: CodingCat <[email protected]>
Date: 2017-11-02T17:21:23Z
fix storage level
commit a24b7bbfa6c393974d553ba703777934bed85ec1
Author: CodingCat <[email protected]>
Date: 2017-11-02T17:42:25Z
fix getOrCompute
commit 0e8e6395df97b4adeabb5a32ad441b96e5c19eb9
Author: CodingCat <[email protected]>
Date: 2017-11-02T17:55:36Z
evaluate with partition metadata
commit b89d58b26650ce9b63713a8a2371e280986720bb
Author: CodingCat <[email protected]>
Date: 2017-11-02T18:21:19Z
fix getOrCompute
commit 3f2eae73cefd5676c799a6fd1e384556ee6c33a8
Author: CodingCat <[email protected]>
Date: 2017-11-02T19:06:49Z
add logging
commit 507c1a22da7de8b5bd24c066ec61a1c5f1c604dd
Author: CodingCat <[email protected]>
Date: 2017-11-02T19:12:57Z
add logging for skipped partition
commit 40d441cc6985aa995e5c7fe1387744381425e6ef
Author: CodingCat <[email protected]>
Date: 2017-11-02T20:15:29Z
try to print stats
commit 520e5aab3a06b9c1599af3984d4d2d028ced2ad6
Author: CodingCat <[email protected]>
Date: 2017-11-02T20:20:43Z
add logging for skipped partition
commit 885808fc0c0f78d0ce7bbb5a5d06222bb06bf2cb
Author: CodingCat <[email protected]>
Date: 2017-11-02T20:25:18Z
add logging for skipped partition
commit 37b5971524c4bb7f42b23a2ea013564ce3f6015c
Author: CodingCat <[email protected]>
Date: 2017-11-02T21:14:57Z
add logging for skipped partition
commit 4dbfe37d6d719e7755bb6c9160262201908d36f4
Author: CodingCat <[email protected]>
Date: 2017-11-09T21:23:33Z
refactor the code
commit 6165838371bbe345f6bc9d462671ffb3f32cbb94
Author: CodingCat <[email protected]>
Date: 2017-11-09T21:30:57Z
fix compilation issue
commit 05f226772e26ab81b764632a5d979eca4e4deffd
Author: CodingCat <[email protected]>
Date: 2017-11-09T21:33:51Z
refactor the code
commit bcafe822a879bf48982f6d9f3255cbc7a5e8236c
Author: CodingCat <[email protected]>
Date: 2017-11-09T21:49:14Z
test
commit 5b888d303a2b68fd36c293286e8bf6e67c06802c
Author: CodingCat <[email protected]>
Date: 2017-11-09T21:50:11Z
fix compilation issue
commit 977b93f4b67ff5e57f1d0081b5aead70a90ac13f
Author: CodingCat <[email protected]>
Date: 2017-11-09T22:11:04Z
add missing filtering
commit 9c9bcadf6f7f2ae90393f1ab26db595a0a836c99
Author: CodingCat <[email protected]>
Date: 2017-11-09T22:16:43Z
test
commit 56a430722fc7b76b3f2365eee4d94e6ab62380a6
Author: CodingCat <[email protected]>
Date: 2017-11-09T22:22:23Z
test
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]