[jira] [Updated] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

Adam Budde (JIRA) Thu, 16 Feb 2017 08:38:42 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adam Budde updated SPARK-19611:
-------------------------------
    Description: 
This issue replaces 
[SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
#16797|https://github.com/apache/spark/pull/16797]

[SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
schema inferrence from the HiveMetastoreCatalog class when converting a 
MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
favor of simply using the schema returend by the metastore. This results in an 
optimization as the underlying file status no longer need to be resolved until 
after the partition pruning step, reducing the number of files to be touched 
significantly in some cases. The downside is that the data schema used may no 
longer match the underlying file schema for case-sensitive formats such as 
Parquet.

[SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
for saving a case-sensitive copy of the schema in the metastore table 
properties, which HiveExternalCatalog will read in as the table's schema if it 
is present. If it is not present, it will fall back to the case-insensitive 
metastore schema.

Unfortunately, this silently breaks queries over tables where the underlying 
data fields are case-sensitive but a case-sensitive schema wasn't written to 
the table properties by Spark. This situation will occur for any Hive table 
that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
user attempts to run a query over such a table containing a case-sensitive 
field name in the query projection or in the query filter, the query will 
return 0 results in every case.

The change we are proposing is to bring back the schema inference that was used 
prior to Spark 2.1.0 if a case-sensitive schema can't be read from the table 
properties.
- INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
schema can be read from the table properties. Attempt to save the inferred 
schema in the table properties to avoid future inference.
- INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
don't attempt to save it.
- NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
Hive Metatore. Useful if the user knows that none of the underlying data is 
case-sensitive.

See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
for more discussion around this issue and the proposed solution.

  was:
This issue replaces 
[SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
#16797|https://github.com/apache/spark/pull/16797]

[SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
schema inferrence from the HiveMetastoreCatalog class when converting a 
MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
favor of simply using the schema returend by the metastore. This results in an 
optimization as the underlying file status no longer need to be resolved until 
after the partition pruning step, reducing the number of files to be touched 
significantly in some cases. The downside is that the data schema used may no 
longer match the underlying file schema for case-sensitive formats such as 
Parquet.

Unfortunately, this silently breaks queries over tables where the underlying 
data fields are case-sensitive but a case-sensitive schema wasn't written to 
the table properties by Spark. This situation will occur for any Hive table 
that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
user attempts to run a query over such a table containing a case-sensitive 
field name in the query projection or in the query filter, the query will 
return 0 results in every case.

The change we are proposing is to bring back the schema inference that was used 
prior to Spark 2.1.0 if a case-sensitive schema can't be read from the table 
properties.
- INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
schema can be read from the table properties. Attempt to save the inferred 
schema in the table properties to avoid future inference.
- INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
don't attempt to save it.
- NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
Hive Metatore. Useful if the user knows that none of the underlying data is 
case-sensitive.

See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
for more discussion around this issue and the proposed solution.


> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> -----------------------------------------------------------------------
>
>                 Key: SPARK-19611
>                 URL: https://issues.apache.org/jira/browse/SPARK-19611
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Adam Budde
>
> This issue replaces 
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for saving a case-sensitive copy of the schema in the metastore table 
> properties, which HiveExternalCatalog will read in as the table's schema if 
> it is present. If it is not present, it will fall back to the 
> case-insensitive metastore schema.
> Unfortunately, this silently breaks queries over tables where the underlying 
> data fields are case-sensitive but a case-sensitive schema wasn't written to 
> the table properties by Spark. This situation will occur for any Hive table 
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
> user attempts to run a query over such a table containing a case-sensitive 
> field name in the query projection or in the query filter, the query will 
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was 
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the 
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
> schema can be read from the table properties. Attempt to save the inferred 
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
> Hive Metatore. Useful if the user knows that none of the underlying data is 
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
> for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

Reply via email to