GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/12279
[SPARK-14217][SQL] Fix bug if parquet data has columns that use dictionary
encoding for some of the data
## What changes were proposed in this pull request?
This PR is based on #12017
Currently, this causes batches where some values are dictionary encoded and
some
which are not. The non-dictionary encoded values cause us to remove the
dictionary
from the batch causing the first values to return garbage.
This patch fixes the issue by first decoding the dictionary for the values
that are
already dictionary encoded before switching. A similar thing is done for
the reverse
case where the initial values are not dictionary encoded.
## How was this patch tested?
This is difficult to test but replicated on a test cluster using a large
tpcds data set.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/davies/spark fix_dict
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12279.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12279
----
commit 1ba0a4c1b69f1fe33ccc45c16836ea21a72e8bc3
Author: Nong Li <[email protected]>
Date: 2016-03-28T23:18:37Z
[SPARK-14217][SQL] Fix bug if parquet data has columns that use dictionary
encoding for some of the data.
Currently, this causes batches where some values are dictionary encoded and
some
which are not. The non-dictionary encoded values cause us to remove the
dictionary
from the batch causing the first values to return garbage.
This patch fixes the issue by first decoding the dictionary for the values
that are
already dictionary encoded before switching. A similar thing is done for
the reverse
case where the initial values are not dictionary encoded.
commit e4bb4ec47956000717a7cd4c267421818864bec0
Author: Nong Li <[email protected]>
Date: 2016-03-29T18:20:27Z
CR
commit 7ee13e71c4ca930bc9b111b22d710fb19347730a
Author: Davies Liu <[email protected]>
Date: 2016-04-08T18:10:13Z
Merge commit 'refs/pull/12017/head' of github.com:apache/spark into fix_dict
commit b039742fdc3738e936af6763f06552c5f430f616
Author: Davies Liu <[email protected]>
Date: 2016-04-09T22:40:53Z
fix bugs
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]