GitHub user yuchenhuo opened a pull request:
https://github.com/apache/spark/pull/20953
[SPARK-23822][SQL] Improve error message for Parquet schema mismatches
## What changes were proposed in this pull request?
This pull request tries to improve the error message for spark while
reading parquet files with different schemas, e.g. One with a STRING column and
the other with a INT column. A new
ParquetSchemaColumnConvertNotSupportedException is added to replace the old
UnsupportedOperationException. The Exception is again wrapped in
FileScanRdd.scala to throw a more a general QueryExecutionException with the
actual parquet file name which trigger the exception.
## How was this patch tested?
Unit tests added to check the new exception and verify the error messages.
Also manually tested with two parquet with different schema to check the
error message.
<img width="1125" alt="screen shot 2018-03-30 at 4 03 04 pm"
src="https://user-images.githubusercontent.com/37087310/38156580-dd58a140-3433-11e8-973a-b816d859fbe1.png">
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/yuchenhuo/OSS SPARK-23822
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20953.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20953
----
commit 375e8991f15b61b6c595c2da9ae345e61791a7f7
Author: Yuchen Huo <yuchen.huo@...>
Date: 2018-03-29T16:58:34Z
Wrap parquet mismatch error message
commit ab55256efb1a23831a53a9c67475ff653dc91e96
Author: Yuchen Huo <yuchen.huo@...>
Date: 2018-03-30T22:39:03Z
fix wrong order of the physical and logical schema
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]