GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/12279

    [SPARK-14217][SQL] Fix bug if parquet data has columns that use dictionary 
encoding for some of the data

    ## What changes were proposed in this pull request?
    
    This PR is based on #12017
    
    Currently, this causes batches where some values are dictionary encoded and 
some
    which are not. The non-dictionary encoded values cause us to remove the 
dictionary
    from the batch causing the first values to return garbage.
    
    This patch fixes the issue by first decoding the dictionary for the values 
that are
    already dictionary encoded before switching. A similar thing is done for 
the reverse
    case where the initial values are not dictionary encoded.
    
    ## How was this patch tested?
    
    This is difficult to test but replicated on a test cluster using a large 
tpcds data set.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark fix_dict

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12279.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12279
    
----
commit 1ba0a4c1b69f1fe33ccc45c16836ea21a72e8bc3
Author: Nong Li <[email protected]>
Date:   2016-03-28T23:18:37Z

    [SPARK-14217][SQL] Fix bug if parquet data has columns that use dictionary 
encoding for some of the data.
    
    Currently, this causes batches where some values are dictionary encoded and 
some
    which are not. The non-dictionary encoded values cause us to remove the 
dictionary
    from the batch causing the first values to return garbage.
    
    This patch fixes the issue by first decoding the dictionary for the values 
that are
    already dictionary encoded before switching. A similar thing is done for 
the reverse
    case where the initial values are not dictionary encoded.

commit e4bb4ec47956000717a7cd4c267421818864bec0
Author: Nong Li <[email protected]>
Date:   2016-03-29T18:20:27Z

    CR

commit 7ee13e71c4ca930bc9b111b22d710fb19347730a
Author: Davies Liu <[email protected]>
Date:   2016-04-08T18:10:13Z

    Merge commit 'refs/pull/12017/head' of github.com:apache/spark into fix_dict

commit b039742fdc3738e936af6763f06552c5f430f616
Author: Davies Liu <[email protected]>
Date:   2016-04-09T22:40:53Z

    fix bugs

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to