GitHub user nongli opened a pull request:

    https://github.com/apache/spark/pull/12017

    [SPARK-14217][SQL] Fix bug if parquet data has columns that use dictionary 
encoding for some of the data

    ## What changes were proposed in this pull request?
    
    Currently, this causes batches where some values are dictionary encoded and 
some
    which are not. The non-dictionary encoded values cause us to remove the 
dictionary
    from the batch causing the first values to return garbage.
    
    This patch fixes the issue by first decoding the dictionary for the values 
that are
    already dictionary encoded before switching. A similar thing is done for 
the reverse
    case where the initial values are not dictionary encoded.
    
    ## How was this patch tested?
    This is difficult to test but replicated on a test cluster using a large 
tpcds data set.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nongli/spark spark-14217

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12017.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12017
    
----
commit 1ba0a4c1b69f1fe33ccc45c16836ea21a72e8bc3
Author: Nong Li <[email protected]>
Date:   2016-03-28T23:18:37Z

    [SPARK-14217][SQL] Fix bug if parquet data has columns that use dictionary 
encoding for some of the data.
    
    Currently, this causes batches where some values are dictionary encoded and 
some
    which are not. The non-dictionary encoded values cause us to remove the 
dictionary
    from the batch causing the first values to return garbage.
    
    This patch fixes the issue by first decoding the dictionary for the values 
that are
    already dictionary encoded before switching. A similar thing is done for 
the reverse
    case where the initial values are not dictionary encoded.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to