GitHub user nongli opened a pull request:

    https://github.com/apache/spark/pull/11434

    [SPARK-13574][SQL] Improve parquet decoding of dictionary encoded strings.

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    Before this patch, the decoding of dictionary encoded strings would explode 
the
    dictionary. Every value we decode will create a new copy of the data in the 
columnar
    batch. With this patch, we decode the dictionary values once into the 
columnar batch
    and then for each data value, just populate the length and offset.
    
    ## How was this patch tested?
    
    Results:
    ```
    Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
    String Dictionary:                  Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)
    
--------------------------------------------------------------------------------
    SQL Parquet Vectorized                    481 /  503         21.8          
45.9
    SQL Parquet Vectorized  (Before)          692 /  746         15.2          
66.0
    SQL Parquet MR                           1097 / 1273          9.6         
104.6
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nongli/spark spark-13574

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11434.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11434
    
----
commit e6394139ec695dfd0a3467f977a220848fce8588
Author: Nong Li <[email protected]>
Date:   2016-02-28T01:32:41Z

    [SPARK-13574][SQL] Improve parquet decoding of dictionary encoded strings.
    
    Before this patch, the decoding of dictionary encoded strings would explode 
the
    dictionary. Every value we decode will create a new copy of the data in the 
columnar
    batch. With this patch, we decode the dictionary values once into the 
columnar batch
    and then for each data value, just populate the length and offset.
    
    Results:
    Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
    String Dictionary:                  Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)
    
--------------------------------------------------------------------------------
    SQL Parquet Vectorized                    481 /  503         21.8          
45.9
    SQL Parquet Vectorized  (Before)          692 /  746         15.2          
66.0
    SQL Parquet MR                           1097 / 1273          9.6         
104.6

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to