GitHub user nongli opened a pull request:

    https://github.com/apache/spark/pull/11375

    [SPARK-13499][SQL] Performance improvements for parquet reader.

    ## What changes were proposed in this pull request?
    
    This patch includes these performance fixes:
      - Remove unnecessary setNotNull() calls. The NULL bits are cleared 
already.
      - Speed up RLE group decoding
      - Speed up dictionary decoding by decoding NULLs directly into the result.
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    
    In addition to the updated benchmarks, on TPCDS, the result of these changes
    running Q55 (sf40) is:
    
    TPCDS:                             Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)
    
---------------------------------------------------------------------------------
    q55 (Before)                             6398 / 6616         18.0          
55.5
    q55 (After)                              4983 / 5189         23.1          
43.3

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nongli/spark spark-13499

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11375.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11375
    
----
commit 858080b2626394b7dd975498690dcd0cfd27bf78
Author: Nong Li <[email protected]>
Date:   2016-02-25T07:43:31Z

    [SPARK-13499][SQL] Performance improvements for parquet reader.
    
    This patch includes these performance fixes:
      - Remove unnecessary setNotNull() calls. The NULL bits are cleared 
already.
      - Speed up RLE group decoding
      - Speed up dictionary decoding by decoding NULLs directly into the result.
    
    In addition to the updated benchmarks, on TPCDS, the result of these changes
    running Q55 (sf40) is:
    
    TPCDS:                             Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)
    
---------------------------------------------------------------------------------
    q55 (Before)                             6398 / 6616         18.0          
55.5
    q55 (After)                              4983 / 5189         23.1          
43.3

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to