GitHub user kiszk opened a pull request:

    https://github.com/apache/spark/pull/14091

    [SPARK-16412][SQL] Generate Java code that gets an array in each column of 
CachedBatch when DataFrame.cache() is called

    ## What changes were proposed in this pull request?
    
    Waiting #11956 to be merged.
    
    This PR generates Java code to directly get an array of each column from 
CachedBatch when DataFrame.cache() is called. This is done in whole stage code 
generation.
    
    When DataFrame.cache() is called, data is stored as column-oriented storage 
(columnar cache) in CachedBatch. This PR avoid conversion from column-oriented 
storage to row-oriented storage. This PR handles an array type that is stored 
into a column. 
    
    This PR generates code both for row-oriented storage and column-oriented 
storage only if
     - InMemoryColumnarTableScan exists in a plan sub-tree. A decision is 
performed by checking an given iterator is ColumnaIterator at runtime
     - Sort or join does not exist in a plan sub-tree. 
    
    This PR generates Java code for columnar cache only if types in all 
columns, which are accessed in operations, are primitive or an array
    
    I will add benchmark suites into  
[here](https://github.com/kiszk/spark/blob/SPARK-14098/sql/core/src/test/scala/org/apache/spark/sql/DataFrameCacheBenchmark.scala)
    
    
    
    ## How was this patch tested?
    
    Added new tests into `DataFrameCacheSuite.scala`


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kiszk/spark SPARK-16412

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14091.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14091
    
----
commit 09af5a5851786b918f45c6f997b1c357745fe883
Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
Date:   2016-07-07T10:36:14Z

    support codegen for an array in CachedBatch

commit 8e218e38d5acb6c04db221fcd3cd6d2483926552
Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
Date:   2016-07-07T10:36:34Z

    update test suites

commit 54df41c8691f02dd9eac3eef3d816a130b87a5c9
Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
Date:   2016-07-07T13:18:58Z

    remove debug print

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to