GitHub user andrewor14 opened a pull request:
https://github.com/apache/spark/pull/13899
[SPARK-16196][SQL] Codegen caching + store rows as ColumnarBatches
## What changes were proposed in this pull request?
This patch makes `InMemoryRelation` faster by generating code to store the
input rows as `ColumnarBatches`. This code path is enabled by default but only
supports primitive types, falling back to the old, slower code path if there
are unsupported types (e.g. strings, arrays, UDTs) in the schema.
The old code path reads the input rows into `ColumnBuilder`s, which is slow
because these builders are backed by `ByteBuffer`s and there are a lot of
virtual function calls involved, especially when compression is involved.
The following numbers are derived from the read path (i.e. reading cached
batches from memory). The baseline is the first row. The second and third rows
describe caching performance before this patch. The last row describes caching
performance after this patch.
```
Cache random keys: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
-----------------------------------------------------------------------------------------------
cache = F 890 / 920 47.1
21.2 1.0X
cache = T columnar_batches = F compress = F 1950 / 1978 21.5
46.5 0.5X
cache = T columnar_batches = F compress = T 1893 / 1927 22.2
45.1 0.5X
cache = T columnar_batches = T 540 / 544 77.7
12.9 1.6X
```
## How was this patch tested?
`CacheBenchmark`, `InMemoryColumnarQuerySuite`, existing tests
## Generated code
(Will be posted shortly)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewor14/spark speedup-cache
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13899.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13899
----
commit be1ae40a6a1c1097909006570f7ce0fa42097128
Author: Andrew Or <[email protected]>
Date: 2016-06-17T20:44:18Z
Move it
commit bf11d278cb6420c10d4f748e2b19cead4a3f6391
Author: Andrew Or <[email protected]>
Date: 2016-06-17T21:47:20Z
Add benchmark code
commit 82499c37f8a2a539e97febc05f3f416411dc0985
Author: Andrew Or <[email protected]>
Date: 2016-06-20T18:35:21Z
backup
commit 2f12e96f3d23d49587e15364861dbe34bdfc8972
Author: Andrew Or <[email protected]>
Date: 2016-06-20T22:36:44Z
Narrow benchmarked code + add back old scan code
commit 6da1e71be250fd4ddfe5cbca076ede3b78d67d0e
Author: Andrew Or <[email protected]>
Date: 2016-06-21T18:28:19Z
Fix benchmark to time only the read path
commit fdf321e3c6d9c057193620bfba8fbc97a01e8513
Author: Andrew Or <[email protected]>
Date: 2016-06-21T21:35:01Z
First working impl. of ColumnarBatch based caching
Note, this doesn't work: spark.table("tab1").collect(), because
we're trying to cast ColumnarBatch.Row into UnsafeRow. This works,
however: spark.table("tab1").groupBy("i").sum("j").collect().
commit d0d2661f47d351dab0627fde44e192e144e661a6
Author: Andrew Or <[email protected]>
Date: 2016-06-22T00:45:34Z
Always enable codegen and vectorized hashmap
commit 570d0c3470bfcd095c4a0389cd05c1a2c764bd25
Author: Andrew Or <[email protected]>
Date: 2016-06-22T19:48:37Z
Don't benchmark aggregate
commit 3e96f4efbe17a1f7f6047d937379401daa6f252c
Author: Andrew Or <[email protected]>
Date: 2016-06-22T21:57:56Z
Codegen memory scan using ColumnarBatches
commit 5726d11adb202136f827133ce8f9a3ab595a17f0
Author: Andrew Or <[email protected]>
Date: 2016-06-22T23:10:41Z
Clean up the code a little
commit d255eb02f0188da630f17e8d1af711297cf03e7d
Author: Andrew Or <[email protected]>
Date: 2016-06-22T23:15:45Z
Merge branch 'master' of github.com:apache/spark into speedup-cache
commit f4f81826b5facb83e1ab6cd0988d056feedc5d54
Author: Andrew Or <[email protected]>
Date: 2016-06-22T23:19:42Z
Clean up a little more
commit 41d52b75fa39d09adb40a792cef4e2ffe2e0851f
Author: Andrew Or <[email protected]>
Date: 2016-06-23T22:10:41Z
Generate code for write path to support other types
Previously we could only support schemas where all columns are
Longs because we hardcode putLong and getLong calls in the write
path. This led to unfathomable NPEs if we try to cache something
with other types.
This commit fixes this by generalizing the code to build column
batches.
commit b6618d77e924dd49c9dcd2e31bbb24a3d8fa5d14
Author: Andrew Or <[email protected]>
Date: 2016-06-23T22:15:29Z
Merge branch 'master' of github.com:apache/spark into speedup-cache
commit 06bbfdbf040e509b88e8462c80bb566e0ac314c8
Author: Andrew Or <[email protected]>
Date: 2016-06-23T23:01:44Z
Move cache benchmark to new file
commit 1a12d06e4e3f71cd21229d9adc766d5643dfdfa3
Author: Andrew Or <[email protected]>
Date: 2016-06-23T23:43:13Z
Abstract codegen code into ColumnarBatchScan
commit 8cdbdd0c729936d731e531ee10c2ba4e72ceec57
Author: Andrew Or <[email protected]>
Date: 2016-06-24T00:04:09Z
Introduce CACHE_CODEGEN config to reduce dup code
commit faa6776b92a8ca5281699df3af1f1fc59aa786e8
Author: Andrew Or <[email protected]>
Date: 2016-06-24T00:34:57Z
Add some tests for InMemoryRelation
commit 2ba6b1e2f79a1c41b56e51a7d3a01b06417f03dd
Author: Andrew Or <[email protected]>
Date: 2016-06-24T00:40:11Z
Add some tests for InMemoryRelation
commit 7f09753a5df4465d1e4f0d57d06b53b4637f7470
Author: Andrew Or <[email protected]>
Date: 2016-06-24T00:44:21Z
Fix InMemoryColumnarQuerySuite
commit c72c085b32179113e546fb0251032e95106b2cd3
Author: Andrew Or <[email protected]>
Date: 2016-06-24T19:00:37Z
Clean up code: abstract CachedBatch and ColumnarBatch
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]