bersprockets commented on issue #23165: [SPARK-26188][SQL] FileIndex: don't 
infer data types of partition columns if user specifies schema
URL: https://github.com/apache/spark/pull/23165#issuecomment-467192011
 
 
   Hi @gengliangwang @cloud-fan 
   
   I noticed this PR changed how mixed-cased partition columns are handled when 
the user provides a schema.
   
   Say I have this file structure (note that each instance of ```pS``` is mixed 
case):
   <pre>
   bash-3.2$ find partitioned5 -type d
   partitioned5
   partitioned5/pi=2
   partitioned5/pi=2/pS=foo
   partitioned5/pi=2/pS=bar
   partitioned5/pi=1
   partitioned5/pi=1/pS=foo
   partitioned5/pi=1/pS=bar
   bash-3.2$
   </pre>
   If I load the file with a user-provided schema in 2.4 (before this PR was 
committed) or 2.3, I see:
   <pre>
   
   scala> val df = spark.read.schema("intField int, pi int, ps 
string").parquet("partitioned5")
   df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more 
field]
   scala> df.printSchema
   root
    |-- intField: integer (nullable = true)
    |-- pi: integer (nullable = true)
    |-- ps: string (nullable = true)
   scala>
   </pre>
   However, with this PR I see:
   <pre>
   scala> val df = spark.read.schema("intField int, pi int, ps 
string").parquet("partitioned5")
   df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more 
field]
   scala> df.printSchema
   root
    |-- intField: integer (nullable = true)
    |-- pi: integer (nullable = true)
    |-- pS: string (nullable = true)
   scala>
   </pre>
   Spark is picking up the mixed-case column name ```pS``` from the directory 
name, not the lower-case ```ps``` from my specified schema.
   
   In all cases, ```spark.sql.caseSensitive``` is set to the default (false).
   
   Not sure is this is an issue, but it is a difference.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to