Adam Budde created SPARK-19455:
----------------------------------

             Summary: Add option for case-insensitive Parquet field resolution
                 Key: SPARK-19455
                 URL: https://issues.apache.org/jira/browse/SPARK-19455
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.1.0
            Reporter: Adam Budde


[SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
schema inferrence from the HiveMetastoreCatalog class when converting a 
MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
favor of simply using the schema returend by the metastore. This results in an 
optimization as the underlying file status no longer need to be resolved until 
after the partition pruning step, reducing the number of files to be touched 
significantly in some cases. The downside is that the data schema used may no 
longer match the underlying file schema for case-sensitive formats such as 
Parquet.

This change initially included a [patch to 
ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284]
 that attempted to remedy this conflict by using a case-insentive fallback 
mapping when resolving field names during the schema clipping step. 
[SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333]  later removed 
this patch after 
[SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
for embedding a case-sensitive schema as a Hive Metastore table property. AFAIK 
the assumption here was that the data schema obtained from the Metastore table 
property will be case sensitive and should match the Parquet schema exactly.

The problem arises when dealing with Parquet-backed tables for which this 
schema has not been embedded as a table attributes and for which the underlying 
files contain case-sensitive field names. This will happen for any Hive table 
that was not created by Spark or created by a version prior to 2.1.0. We've 
seen Spark SQL return no results for any query containing a case-sensitive 
field name for such tables.

The change we're proposing is to introduce a configuration parameter that will 
re-enable case-insensitive field name resolution in ParquetReadSupport. This 
option will also disable filter push-down for Parquet, as the filter predicate 
constructed by Spark SQL contains the case-insensitive field names which 
Parquet will return 0 records for when filtering against a case-sensitive 
column name. I was hoping to find a way to construct the filter on-the-fly in 
ParquetReadSupport but Parquet doesn't propegate the Configuration object 
passed to this class to the underlying InternalParquetRecordReader class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to