GitHub user seancxmao opened a pull request:

    https://github.com/apache/spark/pull/22142

    [SPARK-25132][SQL] case-insensitive field resolution when reading from 
Parquet/ORC

    ## What changes were proposed in this pull request?
    Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, regardless of spark.sql.caseSensitive set 
to true or false. This applies not only to Parquet, but also to ORC. Following 
is a brief summary:
    * ParquetFileFormat doesn't support case-insensitive field resolution.
    * native OrcFileFormat supports case-insensitive field resolution, however 
it cannot handle duplicate fields.
    * hive OrcFileFormat doesn't support case-insensitive field resolution.
    
    https://github.com/apache/spark/pull/15799 reverted case-insensitive 
resolution for ParquetFileFormat and hive OrcFileFormat. This PR brings it back 
and improves it to do case-insensitive resolution only if Spark is in 
case-insensitive mode. And field resolution will fail if there is ambiguity, 
i.e. more than one field is matched. ParquetFileFormat, native OrcFileFormat 
and hive OrcFileFormat are all supported.
    
    ## How was this patch tested?
    Unit tests added.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/seancxmao/spark SPARK-25132

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22142.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22142
    
----
commit 5c3d20b654609c86de9c24c9751ec34916f3aabd
Author: seancxmao <seancxmao@...>
Date:   2018-08-17T10:06:28Z

    SPARK-25132: case-insensitive field resolution when reading from Parquet/ORC
    
    * Fix ParquetFileFormat
    * More than one Parquet column is matched
    * Fix OrcFileFormat (both native and hive implementations)
    * Fix issues according to review results: refactor test cases, code style, 
...
    * Test cases: change paruqet/orc file schema from a to A
    * Test cases: let different columns have different value series
    * Refine error message
    * Split multi-format test suite
    * Simplify test cases for ambiguous resolution
    * Simplify test cases to reduce code lines
    * Refine tests and  comments

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to