GitHub user budde opened a pull request:
https://github.com/apache/spark/pull/16797
[SPARK-19455][SQL] Add option for case-insensitive Parquet field resolution
## What changes were proposed in this pull request?
**Summary**
- Add spark.sql.parquet.caseInsensitiveResolution config option
- Add caseInsensitive option to ParquetReadSupport.clipParquetType
- Add ParquetIOSuite test
- Disable Parquet filter push-down when using case-insensitive field
resolution
**Details**
[*Copied from
SPARK-19455*](https://issues.apache.org/jira/browse/SPARK-19455)
[SPARK-16980](https://issues.apache.org/jira/browse/SPARK-16980) removed
the schema inferrence from the HiveMetastoreCatalog class when converting a
MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in
favor of simply using the schema returend by the metastore. This results in an
optimization as the underlying file status no longer need to be resolved until
after the partition pruning step, reducing the number of files to be touched
significantly in some cases. The downside is that the data schema used may no
longer match the underlying file schema for case-sensitive formats such as
Parquet.
This change initially included a [patch to
ParquetReadSupport](https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284)
that attempted to remedy this conflict by using a case-insentive fallback
mapping when resolving field names during the schema clipping step.
[SPARK-1833](https://issues.apache.org/jira/browse/SPARK-18333) later removed
this patch after
[SPARK-17183](https://issues.apache.org/jira/browse/SPARK-17183) added support
for embedding a case-sensitive schema as a Hive Metastore table property. AFAIK
the assumption here was that the data schema obtained from the Metastore table
property will be case sensitive and should match the Parquet schema exactly.
The problem arises when dealing with Parquet-backed tables for which this
schema has not been embedded as a table attributes and for which the underlying
files contain case-sensitive field names. This will happen for any Hive table
that was not created by Spark or created by a version prior to 2.1.0. We've
seen Spark SQL return no results for any query containing a case-sensitive
field name for such tables.
The change we're proposing is to introduce a configuration parameter that
will re-enable case-insensitive field name resolution in ParquetReadSupport.
This option will also disable filter push-down for Parquet, as the filter
predicate constructed by Spark SQL contains the case-insensitive field names
which Parquet will return 0 records for when filtering against a case-sensitive
column name. I was hoping to find a way to construct the filter on-the-fly in
ParquetReadSupport but Parquet doesn't propegate the Configuration object
passed to this class to the underlying InternalParquetRecordReader class.
## How was this patch tested?
This test re-introduces a unit test to ParquetSchemaSuite.scala to test
that case-insensitive schema clipping behaves as expected. It also introduces a
ParquetIOSuite unit test that constructs a case-insensitive catalog table and
ensures case-sensitive Parquet data can still be queried against.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/budde/spark SPARK-19455
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16797.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16797
----
commit 5426271946419a9defb59bb84575501bc8296578
Author: Budde <[email protected]>
Date: 2017-02-02T07:34:15Z
[SPARK-19455][SQL] Add option for case-insensitive Parquet field resolution
- Add spark.sql.parquet.caseInsensitiveResolution config option
- Add caseInsensitive option to ParquetReadSupport.clipParquetType
- Add ParquetIOSuite test
- Disable Parquet filter push-down when using case-insensitive field
resolution
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]