GitHub user kiszk opened a pull request:
https://github.com/apache/spark/pull/15219
[WIP][SPARK-14098][SQL] Generate Java code to build CachedColumnarBatch and
get values when DataFrame.cache() is called
## What changes were proposed in this pull request?
This PR is derived from https://github.com/apache/spark/pull/11956 and
https://github.com/apache/spark/pull/13899.
I will fill details this weekend.
## How was this patch tested?
Added new tests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kiszk/spark columncache
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15219.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15219
----
commit aef8fb3c441942c842d36d27585f1012b355ea72
Author: Andrew Or <[email protected]>
Date: 2016-06-17T21:47:20Z
Add benchmark code
commit bcb98726a746cc5f7091d9d794182e26ad957180
Author: Andrew Or <[email protected]>
Date: 2016-06-20T18:35:21Z
backup
commit c075e2cb95250bd993b1ecbe5fec9ae6332dfcb0
Author: Andrew Or <[email protected]>
Date: 2016-06-20T22:36:44Z
Narrow benchmarked code + add back old scan code
commit 837bb954810708f5aaabdb2f4383138d5486256e
Author: Andrew Or <[email protected]>
Date: 2016-06-21T18:28:19Z
Fix benchmark to time only the read path
commit 828d90e21c7fadc2eaac6aae24a0a39d646a8585
Author: Andrew Or <[email protected]>
Date: 2016-06-21T21:35:01Z
First working impl. of ColumnarBatch based caching
Note, this doesn't work: spark.table("tab1").collect(), because
we're trying to cast ColumnarBatch.Row into UnsafeRow. This works,
however: spark.table("tab1").groupBy("i").sum("j").collect().
commit 431e81acb4644f35ee330d982a0b335d1d5eeeef
Author: Andrew Or <[email protected]>
Date: 2016-06-22T00:45:34Z
Always enable codegen and vectorized hashmap
commit e3f2a236d35153f6bf35bd73b5525ee81862d4ec
Author: Andrew Or <[email protected]>
Date: 2016-06-22T19:48:37Z
Don't benchmark aggregate
commit 43aa81839f2721ab6de0f7b65ce5b9a64794b92d
Author: Andrew Or <[email protected]>
Date: 2016-06-22T21:57:56Z
Codegen memory scan using ColumnarBatches
commit af2897bcc688bb032adf306e2571ccfe5e73f193
Author: Andrew Or <[email protected]>
Date: 2016-06-22T23:10:41Z
Clean up the code a little
commit b3737b817c8a359b40a308512a3f73ae97080b1d
Author: Andrew Or <[email protected]>
Date: 2016-06-22T23:19:42Z
Clean up a little more
commit 75499ca7f301c75be38455a946dee6f3384faf91
Author: Andrew Or <[email protected]>
Date: 2016-06-23T22:10:41Z
Generate code for write path to support other types
Previously we could only support schemas where all columns are
Longs because we hardcode putLong and getLong calls in the write
path. This led to unfathomable NPEs if we try to cache something
with other types.
This commit fixes this by generalizing the code to build column
batches.
commit 1ac9ebb8892f05cda8dfa33e3c8b108d53fc2b77
Author: Andrew Or <[email protected]>
Date: 2016-06-23T23:01:44Z
Move cache benchmark to new file
commit 1613c7fef268702cb31433af5d241cfe4a388565
Author: Andrew Or <[email protected]>
Date: 2016-06-23T23:43:13Z
Abstract codegen code into ColumnarBatchScan
commit f0197ecf688a5471ce89f8ef5a928c838feff7be
Author: Andrew Or <[email protected]>
Date: 2016-06-24T00:04:09Z
Introduce CACHE_CODEGEN config to reduce dup code
commit b2abd051d81d94de241dca252578bb5fe37d0afc
Author: Andrew Or <[email protected]>
Date: 2016-06-24T00:34:57Z
Add some tests for InMemoryRelation
commit d6550f2441fb6922432828b3aad74d937ded4226
Author: Andrew Or <[email protected]>
Date: 2016-06-24T00:40:11Z
Add some tests for InMemoryRelation
commit 5430dadd9688687e7c8ca404c3156f0b8f1c5679
Author: Andrew Or <[email protected]>
Date: 2016-06-24T00:44:21Z
Fix InMemoryColumnarQuerySuite
commit ecf625c9743eb2258bb7700cd6b051e893330c87
Author: Andrew Or <[email protected]>
Date: 2016-06-24T19:00:37Z
Clean up code: abstract CachedBatch and ColumnarBatch
commit ae5e975496d2c7ffa19df0f4ad8fc90a751a271a
Author: Andrew Or <[email protected]>
Date: 2016-06-24T21:53:10Z
Add end-to-end benchmark, including write path
commit 1a621bf67162ba202290394115445d912fdd2715
Author: Kazuaki Ishizaki <[email protected]>
Date: 2016-09-01T02:47:25Z
merge with master
commit a3b1c1796527bd8a2d2a0fca73c197c978765118
Author: Kazuaki Ishizaki <[email protected]>
Date: 2016-09-23T18:26:11Z
support all of data types
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]