[GitHub] spark pull request #15219: [WIP][SPARK-14098][SQL] Generate Java code to bui...

kiszk Fri, 23 Sep 2016 11:35:44 -0700

GitHub user kiszk opened a pull request:

    https://github.com/apache/spark/pull/15219


    [WIP][SPARK-14098][SQL] Generate Java code to build CachedColumnarBatch and 
get values when DataFrame.cache() is called

    ## What changes were proposed in this pull request?
    
    This PR is derived from https://github.com/apache/spark/pull/11956 and 
https://github.com/apache/spark/pull/13899.
    
    I will fill details this weekend.
    
    ## How was this patch tested?
    
    Added new tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kiszk/spark columncache

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15219.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15219
    
----
commit aef8fb3c441942c842d36d27585f1012b355ea72
Author: Andrew Or <[email protected]>
Date:   2016-06-17T21:47:20Z

    Add benchmark code

commit bcb98726a746cc5f7091d9d794182e26ad957180
Author: Andrew Or <[email protected]>
Date:   2016-06-20T18:35:21Z

    backup

commit c075e2cb95250bd993b1ecbe5fec9ae6332dfcb0
Author: Andrew Or <[email protected]>
Date:   2016-06-20T22:36:44Z

    Narrow benchmarked code + add back old scan code

commit 837bb954810708f5aaabdb2f4383138d5486256e
Author: Andrew Or <[email protected]>
Date:   2016-06-21T18:28:19Z

    Fix benchmark to time only the read path

commit 828d90e21c7fadc2eaac6aae24a0a39d646a8585
Author: Andrew Or <[email protected]>
Date:   2016-06-21T21:35:01Z

    First working impl. of ColumnarBatch based caching
    
    Note, this doesn't work: spark.table("tab1").collect(), because
    we're trying to cast ColumnarBatch.Row into UnsafeRow. This works,
    however: spark.table("tab1").groupBy("i").sum("j").collect().

commit 431e81acb4644f35ee330d982a0b335d1d5eeeef
Author: Andrew Or <[email protected]>
Date:   2016-06-22T00:45:34Z

    Always enable codegen and vectorized hashmap

commit e3f2a236d35153f6bf35bd73b5525ee81862d4ec
Author: Andrew Or <[email protected]>
Date:   2016-06-22T19:48:37Z

    Don't benchmark aggregate

commit 43aa81839f2721ab6de0f7b65ce5b9a64794b92d
Author: Andrew Or <[email protected]>
Date:   2016-06-22T21:57:56Z

    Codegen memory scan using ColumnarBatches

commit af2897bcc688bb032adf306e2571ccfe5e73f193
Author: Andrew Or <[email protected]>
Date:   2016-06-22T23:10:41Z

    Clean up the code a little

commit b3737b817c8a359b40a308512a3f73ae97080b1d
Author: Andrew Or <[email protected]>
Date:   2016-06-22T23:19:42Z

    Clean up a little more

commit 75499ca7f301c75be38455a946dee6f3384faf91
Author: Andrew Or <[email protected]>
Date:   2016-06-23T22:10:41Z

    Generate code for write path to support other types
    
    Previously we could only support schemas where all columns are
    Longs because we hardcode putLong and getLong calls in the write
    path. This led to unfathomable NPEs if we try to cache something
    with other types.
    
    This commit fixes this by generalizing the code to build column
    batches.

commit 1ac9ebb8892f05cda8dfa33e3c8b108d53fc2b77
Author: Andrew Or <[email protected]>
Date:   2016-06-23T23:01:44Z

    Move cache benchmark to new file

commit 1613c7fef268702cb31433af5d241cfe4a388565
Author: Andrew Or <[email protected]>
Date:   2016-06-23T23:43:13Z

    Abstract codegen code into ColumnarBatchScan

commit f0197ecf688a5471ce89f8ef5a928c838feff7be
Author: Andrew Or <[email protected]>
Date:   2016-06-24T00:04:09Z

    Introduce CACHE_CODEGEN config to reduce dup code

commit b2abd051d81d94de241dca252578bb5fe37d0afc
Author: Andrew Or <[email protected]>
Date:   2016-06-24T00:34:57Z

    Add some tests for InMemoryRelation

commit d6550f2441fb6922432828b3aad74d937ded4226
Author: Andrew Or <[email protected]>
Date:   2016-06-24T00:40:11Z

    Add some tests for InMemoryRelation

commit 5430dadd9688687e7c8ca404c3156f0b8f1c5679
Author: Andrew Or <[email protected]>
Date:   2016-06-24T00:44:21Z

    Fix InMemoryColumnarQuerySuite

commit ecf625c9743eb2258bb7700cd6b051e893330c87
Author: Andrew Or <[email protected]>
Date:   2016-06-24T19:00:37Z

    Clean up code: abstract CachedBatch and ColumnarBatch

commit ae5e975496d2c7ffa19df0f4ad8fc90a751a271a
Author: Andrew Or <[email protected]>
Date:   2016-06-24T21:53:10Z

    Add end-to-end benchmark, including write path

commit 1a621bf67162ba202290394115445d912fdd2715
Author: Kazuaki Ishizaki <[email protected]>
Date:   2016-09-01T02:47:25Z

    merge with master

commit a3b1c1796527bd8a2d2a0fca73c197c978765118
Author: Kazuaki Ishizaki <[email protected]>
Date:   2016-09-23T18:26:11Z

    support all of data types

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15219: [WIP][SPARK-14098][SQL] Generate Java code to bui...

Reply via email to