Jason Pohl created SPARK-10562:
----------------------------------

             Summary: .partitionBy() creates the metastore partition columns in 
all lowercase, but persists the data path as MixedCase resulting in an error 
when the data is later attempted to query.
                 Key: SPARK-10562
                 URL: https://issues.apache.org/jira/browse/SPARK-10562
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.4.1
            Reporter: Jason Pohl


When using DataFrame.write.partitionBy().saveAsTable() it creates the partiton 
by columns in all lowercase in the meta-store.  However, it writes the data to 
the filesystem using mixed-case.

This causes an error when running a select against the table.
--------------
from pyspark.sql import Row

# Create a data frame with mixed case column names
myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015),
                       Row(Name="Frank Lampard", Goals=15, Year=2012)])

myDF = sqlContext.createDataFrame(myRDD)

# Write this data out to a parquet file and partition by the Year (which is a 
mixedCase name)
myDF.write.partitionBy("Year").saveAsTable("chelsea_goals")

%sql show create table chelsea_goals;
--The metastore is showwing a partition column name of all lowercase "year"

# Verify that the data is written with appropriate partitions
display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals"))

%sql
--Now try to run a query against this table
select * from chelsea_goals

Error in SQL statement: UncheckedExecutionException: 
java.lang.RuntimeException: Partition column year not found in schema 
StructType(StructField(Goals,LongType,true), StructField(Name,StringType,true), 
StructField(Year,LongType,true))

# Now lets try this again using a lowercase column name
myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015),
                         Row(Name="Frank Lampard", Goals=15, year=2012)])

myDF2 = sqlContext.createDataFrame(myRDD2)

myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2")

%sql select * from chelsea_goals2;
--Now everything works






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to