[
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15904033#comment-15904033
]
Apache Spark commented on SPARK-19611:
--------------------------------------
User 'budde' has created a pull request for this issue:
https://github.com/apache/spark/pull/17229
> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> -----------------------------------------------------------------------
>
> Key: SPARK-19611
> URL: https://issues.apache.org/jira/browse/SPARK-19611
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Adam Budde
> Assignee: Adam Budde
> Fix For: 2.2.0
>
>
> This issue replaces
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the
> schema inferrence from the HiveMetastoreCatalog class when converting a
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in
> favor of simply using the schema returend by the metastore. This results in
> an optimization as the underlying file status no longer need to be resolved
> until after the partition pruning step, reducing the number of files to be
> touched significantly in some cases. The downside is that the data schema
> used may no longer match the underlying file schema for case-sensitive
> formats such as Parquet.
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support
> for saving a case-sensitive copy of the schema in the metastore table
> properties, which HiveExternalCatalog will read in as the table's schema if
> it is present. If it is not present, it will fall back to the
> case-insensitive metastore schema.
> Unfortunately, this silently breaks queries over tables where the underlying
> data fields are case-sensitive but a case-sensitive schema wasn't written to
> the table properties by Spark. This situation will occur for any Hive table
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a
> user attempts to run a query over such a table containing a case-sensitive
> field name in the query projection or in the query filter, the query will
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive
> schema can be read from the table properties. Attempt to save the inferred
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the
> Hive Metatore. Useful if the user knows that none of the underlying data is
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797]
> for more discussion around this issue and the proposed solution.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]