[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

cloud-fan Wed, 28 Sep 2016 04:41:57 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/14690
  
    > This PR removes and omits partitioned Hive table schema reconciliation. 
As a result, it fails to find Parquet schema columns with upper case letters 
because of the Hive metastore's case-insensitivity.
    
    After an offline discussion with the team, we think there is a simple fix 
for it, instead of making `HiveExternalCatalog` case preserving via some hacks.
    
    If we have parquet files whose schema has uppercase field names, and create 
a hive external table with it, hive will fail to read it, and always return 
null for every column, because the actual parquet file schema is different from 
the table schema stored in metastore(hive lowercase everything)
    
    However, Spark SQL can still read it if the parquet conversion is enabled, 
because we will list all files and infer the schema before read it. This brings 
some problems:
    
    1. enable/disable an optimization flag should not have impact to the query 
result, but the parquet conversion flag has.
    2. In the first read, we list all files and infer the real schema from 
them, and cache this schema for following reads. This means, the real schema of 
this table is different from what this table announces to users, which is 
dangerous. Users may run LOAD DATA to append new data to this table according 
to the schema returned by DESC TABLE, but can't query it.
    3. the problem this PR meets
    
    I think we can just remove the table schema reconciliation when converting 
hive parquet table to data source table, and fix the failed tests.
    
    But we also need to make sure 
`ParquetFileFormat.mergeMetastoreParquetSchema` only reconcile the upper/lower 
case problems.
    
    cc @yhuai @liancheng



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

Reply via email to