[GitHub] spark pull request #13899: [SPARK-16196][SQL] Codegen caching + store rows a...

andrewor14 Fri, 24 Jun 2016 14:33:08 -0700

GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/13899


    [SPARK-16196][SQL] Codegen caching + store rows as ColumnarBatches

    ## What changes were proposed in this pull request?
    
    This patch makes `InMemoryRelation` faster by generating code to store the 
input rows as `ColumnarBatches`. This code path is enabled by default but only 
supports primitive types, falling back to the old, slower code path if there 
are unsupported types (e.g. strings, arrays, UDTs) in the schema.
    
    The old code path reads the input rows into `ColumnBuilder`s, which is slow 
because these builders are backed by `ByteBuffer`s and there are a lot of 
virtual function calls involved, especially when compression is involved.
    
    The following numbers are derived from the read path (i.e. reading cached 
batches from memory). The baseline is the first row. The second and third rows 
describe caching performance before this patch. The last row describes caching 
performance after this patch.
    ```
    Cache random keys:                       Best/Avg Time(ms)   Rate(M/s)   
Per Row(ns)   Relative
    
-----------------------------------------------------------------------------------------------
    cache = F                                      890 /  920        47.1       
   21.2       1.0X
    cache = T columnar_batches = F compress = F   1950 / 1978        21.5       
   46.5       0.5X
    cache = T columnar_batches = F compress = T   1893 / 1927        22.2       
   45.1       0.5X
    cache = T columnar_batches = T                 540 /  544        77.7       
   12.9       1.6X
    ```
    
    ## How was this patch tested?
    
    `CacheBenchmark`, `InMemoryColumnarQuerySuite`, existing tests
    
    ## Generated code
    
    (Will be posted shortly)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark speedup-cache

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13899.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13899
    
----
commit be1ae40a6a1c1097909006570f7ce0fa42097128
Author: Andrew Or <[email protected]>
Date:   2016-06-17T20:44:18Z

    Move it

commit bf11d278cb6420c10d4f748e2b19cead4a3f6391
Author: Andrew Or <[email protected]>
Date:   2016-06-17T21:47:20Z

    Add benchmark code

commit 82499c37f8a2a539e97febc05f3f416411dc0985
Author: Andrew Or <[email protected]>
Date:   2016-06-20T18:35:21Z

    backup

commit 2f12e96f3d23d49587e15364861dbe34bdfc8972
Author: Andrew Or <[email protected]>
Date:   2016-06-20T22:36:44Z

    Narrow benchmarked code + add back old scan code

commit 6da1e71be250fd4ddfe5cbca076ede3b78d67d0e
Author: Andrew Or <[email protected]>
Date:   2016-06-21T18:28:19Z

    Fix benchmark to time only the read path

commit fdf321e3c6d9c057193620bfba8fbc97a01e8513
Author: Andrew Or <[email protected]>
Date:   2016-06-21T21:35:01Z

    First working impl. of ColumnarBatch based caching
    
    Note, this doesn't work: spark.table("tab1").collect(), because
    we're trying to cast ColumnarBatch.Row into UnsafeRow. This works,
    however: spark.table("tab1").groupBy("i").sum("j").collect().

commit d0d2661f47d351dab0627fde44e192e144e661a6
Author: Andrew Or <[email protected]>
Date:   2016-06-22T00:45:34Z

    Always enable codegen and vectorized hashmap

commit 570d0c3470bfcd095c4a0389cd05c1a2c764bd25
Author: Andrew Or <[email protected]>
Date:   2016-06-22T19:48:37Z

    Don't benchmark aggregate

commit 3e96f4efbe17a1f7f6047d937379401daa6f252c
Author: Andrew Or <[email protected]>
Date:   2016-06-22T21:57:56Z

    Codegen memory scan using ColumnarBatches

commit 5726d11adb202136f827133ce8f9a3ab595a17f0
Author: Andrew Or <[email protected]>
Date:   2016-06-22T23:10:41Z

    Clean up the code a little

commit d255eb02f0188da630f17e8d1af711297cf03e7d
Author: Andrew Or <[email protected]>
Date:   2016-06-22T23:15:45Z

    Merge branch 'master' of github.com:apache/spark into speedup-cache

commit f4f81826b5facb83e1ab6cd0988d056feedc5d54
Author: Andrew Or <[email protected]>
Date:   2016-06-22T23:19:42Z

    Clean up a little more

commit 41d52b75fa39d09adb40a792cef4e2ffe2e0851f
Author: Andrew Or <[email protected]>
Date:   2016-06-23T22:10:41Z

    Generate code for write path to support other types
    
    Previously we could only support schemas where all columns are
    Longs because we hardcode putLong and getLong calls in the write
    path. This led to unfathomable NPEs if we try to cache something
    with other types.
    
    This commit fixes this by generalizing the code to build column
    batches.

commit b6618d77e924dd49c9dcd2e31bbb24a3d8fa5d14
Author: Andrew Or <[email protected]>
Date:   2016-06-23T22:15:29Z

    Merge branch 'master' of github.com:apache/spark into speedup-cache

commit 06bbfdbf040e509b88e8462c80bb566e0ac314c8
Author: Andrew Or <[email protected]>
Date:   2016-06-23T23:01:44Z

    Move cache benchmark to new file

commit 1a12d06e4e3f71cd21229d9adc766d5643dfdfa3
Author: Andrew Or <[email protected]>
Date:   2016-06-23T23:43:13Z

    Abstract codegen code into ColumnarBatchScan

commit 8cdbdd0c729936d731e531ee10c2ba4e72ceec57
Author: Andrew Or <[email protected]>
Date:   2016-06-24T00:04:09Z

    Introduce CACHE_CODEGEN config to reduce dup code

commit faa6776b92a8ca5281699df3af1f1fc59aa786e8
Author: Andrew Or <[email protected]>
Date:   2016-06-24T00:34:57Z

    Add some tests for InMemoryRelation

commit 2ba6b1e2f79a1c41b56e51a7d3a01b06417f03dd
Author: Andrew Or <[email protected]>
Date:   2016-06-24T00:40:11Z

    Add some tests for InMemoryRelation

commit 7f09753a5df4465d1e4f0d57d06b53b4637f7470
Author: Andrew Or <[email protected]>
Date:   2016-06-24T00:44:21Z

    Fix InMemoryColumnarQuerySuite

commit c72c085b32179113e546fb0251032e95106b2cd3
Author: Andrew Or <[email protected]>
Date:   2016-06-24T19:00:37Z

    Clean up code: abstract CachedBatch and ColumnarBatch

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #13899: [SPARK-16196][SQL] Codegen caching + store rows a...

Reply via email to