[GitHub] spark pull request: [SPARK-6010] [SQL] Merging compatible Parquet ...

liancheng Wed, 25 Feb 2015 09:35:11 -0800

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/4768


    [SPARK-6010] [SQL] Merging compatible Parquet schemas before computing 
splits

    `ReadContext.init` calls `InitContext.getMergedKeyValueMetadata`, which 
doesn't know how to merge conflicting user defined key-value metadata and 
throws exception. In our case, when dealing with different but compatible 
schemas, we have different Spark SQL schema JSON strings in different Parquet 
part-files, thus causes this problem. Reading similar Parquet files generated 
by Hive doesn't suffer from this issue.
    
    In this PR, we manually merge the schemas before passing it to 
`ReadContext` to avoid the exception.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark spark-6010

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4768.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4768
    
----
commit 9002f0a3b631474f8bff23c70a854064fe3e3fda
Author: Cheng Lian <[email protected]>
Date:   2015-02-25T17:20:35Z

    Fixes SPARK-6010

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-6010] [SQL] Merging compatible Parquet ...

Reply via email to