Adam Budde created SPARK-19611:
----------------------------------
Summary: Spark 2.1.0 breaks some Hive tables backed by
case-sensitive data files
Key: SPARK-19611
URL: https://issues.apache.org/jira/browse/SPARK-19611
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.1.0
Reporter: Adam Budde
This issue replaces
[SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR
#16797|https://github.com/apache/spark/pull/16797]
[SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the
schema inferrence from the HiveMetastoreCatalog class when converting a
MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in
favor of simply using the schema returend by the metastore. This results in an
optimization as the underlying file status no longer need to be resolved until
after the partition pruning step, reducing the number of files to be touched
significantly in some cases. The downside is that the data schema used may no
longer match the underlying file schema for case-sensitive formats such as
Parquet.
Unfortunately, this silently breaks queries over tables where the underlying
data fields are case-sensitive but a case-sensitive schema wasn't written to
the table properties by Spark. This situation will occur for any Hive table
that wasn't created by Spark or that was created prior to Spark 2.1.0. If a
user attempts to run a query over such a table containing a case-sensitive
field name in the query projection or in the query filter, the query will
return 0 results in every case.
The change we are proposing is to bring back the schema inference that was used
prior to Spark 2.1.0 if a case-sensitive schema can't be read from the table
properties.
- INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive
schema can be read from the table properties. Attempt to save the inferred
schema in the table properties to avoid future inference.
- INFER_ONLY: Infer the schema if no case-sensitive schema can be read but
don't attempt to save it.
- NEVER_INFER: Fall back to using the case-insensitive schema returned by the
Hive Metatore. Useful if the user knows that none of the underlying data is
case-sensitive.
See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797]
for more discussion around this issue and the proposed solution.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]