[
https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adam Budde closed SPARK-19455.
------------------------------
Resolution: Duplicate
Closing in favor of https://issues.apache.org/jira/browse/SPARK-19611
> Add option for case-insensitive Parquet field resolution
> --------------------------------------------------------
>
> Key: SPARK-19455
> URL: https://issues.apache.org/jira/browse/SPARK-19455
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Adam Budde
>
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the
> schema inferrence from the HiveMetastoreCatalog class when converting a
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in
> favor of simply using the schema returend by the metastore. This results in
> an optimization as the underlying file status no longer need to be resolved
> until after the partition pruning step, reducing the number of files to be
> touched significantly in some cases. The downside is that the data schema
> used may no longer match the underlying file schema for case-sensitive
> formats such as Parquet.
> This change initially included a [patch to
> ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284]
> that attempted to remedy this conflict by using a case-insentive fallback
> mapping when resolving field names during the schema clipping step.
> [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333] later removed
> this patch after
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support
> for embedding a case-sensitive schema as a Hive Metastore table property.
> AFAIK the assumption here was that the data schema obtained from the
> Metastore table property will be case sensitive and should match the Parquet
> schema exactly.
> The problem arises when dealing with Parquet-backed tables for which this
> schema has not been embedded as a table attributes and for which the
> underlying files contain case-sensitive field names. This will happen for any
> Hive table that was not created by Spark or created by a version prior to
> 2.1.0. We've seen Spark SQL return no results for any query containing a
> case-sensitive field name for such tables.
> The change we're proposing is to introduce a configuration parameter that
> will re-enable case-insensitive field name resolution in ParquetReadSupport.
> This option will also disable filter push-down for Parquet, as the filter
> predicate constructed by Spark SQL contains the case-insensitive field names
> which Parquet will return 0 records for when filtering against a
> case-sensitive column name. I was hoping to find a way to construct the
> filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the
> Configuration object passed to this class to the underlying
> InternalParquetRecordReader class.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]