GitHub user nongli opened a pull request:
https://github.com/apache/spark/pull/12017
[SPARK-14217][SQL] Fix bug if parquet data has columns that use dictionary
encoding for some of the data
## What changes were proposed in this pull request?
Currently, this causes batches where some values are dictionary encoded and
some
which are not. The non-dictionary encoded values cause us to remove the
dictionary
from the batch causing the first values to return garbage.
This patch fixes the issue by first decoding the dictionary for the values
that are
already dictionary encoded before switching. A similar thing is done for
the reverse
case where the initial values are not dictionary encoded.
## How was this patch tested?
This is difficult to test but replicated on a test cluster using a large
tpcds data set.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/nongli/spark spark-14217
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12017.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12017
----
commit 1ba0a4c1b69f1fe33ccc45c16836ea21a72e8bc3
Author: Nong Li <[email protected]>
Date: 2016-03-28T23:18:37Z
[SPARK-14217][SQL] Fix bug if parquet data has columns that use dictionary
encoding for some of the data.
Currently, this causes batches where some values are dictionary encoded and
some
which are not. The non-dictionary encoded values cause us to remove the
dictionary
from the batch causing the first values to return garbage.
This patch fixes the issue by first decoding the dictionary for the values
that are
already dictionary encoded before switching. A similar thing is done for
the reverse
case where the initial values are not dictionary encoded.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]