GitHub user nongli opened a pull request:

    https://github.com/apache/spark/pull/11435

    [SPARK-13255][SQL] Update vectorized reader to directly return 
ColumnarBatch instead of InternalRows.

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    Currently, the parquet reader returns rows one by one which is bad for 
performance. This patch
    updates the reader to directly return ColumnarBatches. This is only enabled 
with whole stage
    codegen, which is the only operator currently that is able to consume 
ColumnarBatches (instead
    of rows). The current implementation is a bit of a hack to get this to work 
and we should do
    more refactoring of these low level interfaces to make this work better.
    
    ## How was this patch tested?
    
    ```
    Results:
    TPCDS:                             Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)
    
---------------------------------------------------------------------------------
    q55 (before)                             8897 / 9265         12.9          
77.2
    q55                                      5486 / 5753         21.0          
47.6
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nongli/spark spark-13255

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11435.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11435
    
----
commit c1b6ab1d00542b4990e119ca896eecbd677eeadd
Author: Nong Li <[email protected]>
Date:   2016-02-24T23:06:46Z

    [SPARK-13255][SQL] Update vectorized reader to directly return 
ColumnarBatch instead of InternalRows.
    
    Currently, the parquet reader returns rows one by one which is bad for 
performance. This patch
    updates the reader to directly return ColumnarBatches. This is only enabled 
with whole stage
    codegen, which is the only operator currently that is able to consume 
ColumnarBatches (instead
    of rows). The current implementation is a bit of a hack to get this to work 
and we should do
    more refactoring of these low level interfaces to make this work better.
    
    Results:
    TPCDS:                             Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)
    
---------------------------------------------------------------------------------
    q55 (before)                             8897 / 9265         12.9          
77.2
    q55                                      5486 / 5753         21.0          
47.6

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to