GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/14944

    [SPARK-16334][BACKPORT] Reusing same dictionary column for decoding 
consecutive row groups shouldn't throw an error

    ## What changes were proposed in this pull request?
    
    Backports https://github.com/apache/spark/pull/14941 in 2.0. This patch 
fixes a bug in the vectorized parquet reader that's caused by re-using the same 
dictionary column vector while reading consecutive row groups. Specifically, 
this issue manifests for a certain distribution of dictionary/plain encoded 
data while we read/populate the underlying bit packed dictionary data into a 
column-vector based data structure.
    
    Manually tested on datasets provided by the community. Thanks to Chris 
Perluss and Keith Kraus for their invaluable help in tracking down this issue!
    
    Author: Sameer Agarwal <[email protected]>
    
    Closes #14941 from sameeragarwal/parquet-exception-2.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark branch-2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14944.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14944
    
----
commit facf221006295f8984498de20c66da78cb0d42fa
Author: Sameer Agarwal <[email protected]>
Date:   2016-09-02T22:16:16Z

    [SPARK-16334] Reusing same dictionary column for decoding consecutive row 
groups shouldn't throw an error
    
    This patch fixes a bug in the vectorized parquet reader that's caused by 
re-using the same dictionary column vector while reading consecutive row 
groups. Specifically, this issue manifests for a certain distribution of 
dictionary/plain encoded data while we read/populate the underlying bit packed 
dictionary data into a column-vector based data structure.
    
    Manually tested on datasets provided by the community. Thanks to Chris 
Perluss and Keith Kraus for their invaluable help in tracking down this issue!
    
    Author: Sameer Agarwal <[email protected]>
    
    Closes #14941 from sameeragarwal/parquet-exception-2.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to