[GitHub] [spark] cloud-fan opened a new pull request #24041: [SPARK-27119][SQL] Do not infer schema when reading Hive serde table with native data source

GitBox Sat, 09 Mar 2019 05:52:39 -0800

cloud-fan opened a new pull request #24041: [SPARK-27119][SQL] Do not infer 
schema when reading Hive serde table with native data source
URL: https://github.com/apache/spark/pull/24041
 
 
   ## What changes were proposed in this pull request?
   
   In Spark 2.1, we hit a correctness bug. When reading a Hive serde parquet 
table with the native parquet data source, and the actual file schema doesn't 
match the table schema in Hive metastore(only upper/lower case difference), the 
query returns 0 results.
   
   The reason is that, the parquet reader is case sensitive. If we push down 
filters with column names that don't match the file physical schema 
case-sensitively, no data will be returned.
   
   To fix this bug, there were 2 solutions proposed at that time:
   1. Add a config to optionally disable parquet filter pushdown, and make 
parquet column pruning case insensitive.
   https://github.com/apache/spark/pull/16797
   
   2. Infer the actual schema from data files, when reading Hive serde table 
with native data source. A config is provided to disable it.
   https://github.com/apache/spark/pull/17229
   
   Solution 2 was accepted and merged to Spark 2.1.1
   
   In Spark 2.4, we refactored the parquet data source a little:
   1. do parquet filter pushdown with the actual file schema.
   https://github.com/apache/spark/pull/21696
   
   2. make parquet filter pushdown case insensitive.
   https://github.com/apache/spark/pull/22197
   
   3. make parquet column pruning case insensitive.
   https://github.com/apache/spark/pull/22148
   
   With these patches, the correctness bug in Spark 2.1 no longer exists, and 
the schema inference becomes unnecessary.
   
   To be safe, this PR just changes the default value to NEVER_INFER, so that 
users can set it back to INFER_AND_SAVE. If we don't receive any bug reports 
for it, we can remove the related code in the next release.
   
   ## How was this patch tested?
   
   existing tests


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan opened a new pull request #24041: [SPARK-27119][SQL] Do not infer schema when reading Hive serde table with native data source

Reply via email to