GitHub user liancheng opened a pull request:
https://github.com/apache/spark/pull/4768
[SPARK-6010] [SQL] Merging compatible Parquet schemas before computing
splits
`ReadContext.init` calls `InitContext.getMergedKeyValueMetadata`, which
doesn't know how to merge conflicting user defined key-value metadata and
throws exception. In our case, when dealing with different but compatible
schemas, we have different Spark SQL schema JSON strings in different Parquet
part-files, thus causes this problem. Reading similar Parquet files generated
by Hive doesn't suffer from this issue.
In this PR, we manually merge the schemas before passing it to
`ReadContext` to avoid the exception.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/liancheng/spark spark-6010
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/4768.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4768
----
commit 9002f0a3b631474f8bff23c70a854064fe3e3fda
Author: Cheng Lian <[email protected]>
Date: 2015-02-25T17:20:35Z
Fixes SPARK-6010
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]