[ https://issues.apache.org/jira/browse/SPARK-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cheng Lian updated SPARK-10562: ------------------------------- Description: When using DataFrame.write.partitionBy().saveAsTable() it creates the partiton by columns in all lowercase in the meta-store. However, it writes the data to the filesystem using mixed-case. This causes an error when running a select against the table. {noformat} from pyspark.sql import Row # Create a data frame with mixed case column names myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015), Row(Name="Frank Lampard", Goals=15, Year=2012)]) myDF = sqlContext.createDataFrame(myRDD) # Write this data out to a parquet file and partition by the Year (which is a mixedCase name) myDF.write.partitionBy("Year").saveAsTable("chelsea_goals") %sql show create table chelsea_goals; --The metastore is showwing a partition column name of all lowercase "year" # Verify that the data is written with appropriate partitions display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals")) {noformat} {code:sql} %sql -- Now try to run a query against this table select * from chelsea_goals {code} {noformat} Error in SQL statement: UncheckedExecutionException: java.lang.RuntimeException: Partition column year not found in schema StructType(StructField(Goals,LongType,true), StructField(Name,StringType,true), StructField(Year,LongType,true)) {noformat} {noformat} # Now lets try this again using a lowercase column name myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015), Row(Name="Frank Lampard", Goals=15, year=2012)]) myDF2 = sqlContext.createDataFrame(myRDD2) myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2") {noformat} {code:sql} %sql select * from chelsea_goals2; --Now everything works {code} was: When using DataFrame.write.partitionBy().saveAsTable() it creates the partiton by columns in all lowercase in the meta-store. However, it writes the data to the filesystem using mixed-case. This causes an error when running a select against the table. -------------- from pyspark.sql import Row # Create a data frame with mixed case column names myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015), Row(Name="Frank Lampard", Goals=15, Year=2012)]) myDF = sqlContext.createDataFrame(myRDD) # Write this data out to a parquet file and partition by the Year (which is a mixedCase name) myDF.write.partitionBy("Year").saveAsTable("chelsea_goals") %sql show create table chelsea_goals; --The metastore is showwing a partition column name of all lowercase "year" # Verify that the data is written with appropriate partitions display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals")) %sql --Now try to run a query against this table select * from chelsea_goals Error in SQL statement: UncheckedExecutionException: java.lang.RuntimeException: Partition column year not found in schema StructType(StructField(Goals,LongType,true), StructField(Name,StringType,true), StructField(Year,LongType,true)) # Now lets try this again using a lowercase column name myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015), Row(Name="Frank Lampard", Goals=15, year=2012)]) myDF2 = sqlContext.createDataFrame(myRDD2) myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2") %sql select * from chelsea_goals2; --Now everything works > .partitionBy() creates the metastore partition columns in all lowercase, but > persists the data path as MixedCase resulting in an error when the data is > later attempted to query. > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-10562 > URL: https://issues.apache.org/jira/browse/SPARK-10562 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.4.1 > Reporter: Jason Pohl > Assignee: Wenchen Fan > Attachments: MixedCasePartitionBy.dbc > > > When using DataFrame.write.partitionBy().saveAsTable() it creates the > partiton by columns in all lowercase in the meta-store. However, it writes > the data to the filesystem using mixed-case. > This causes an error when running a select against the table. > {noformat} > from pyspark.sql import Row > # Create a data frame with mixed case column names > myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015), > Row(Name="Frank Lampard", Goals=15, Year=2012)]) > myDF = sqlContext.createDataFrame(myRDD) > # Write this data out to a parquet file and partition by the Year (which is a > mixedCase name) > myDF.write.partitionBy("Year").saveAsTable("chelsea_goals") > %sql show create table chelsea_goals; > --The metastore is showwing a partition column name of all lowercase "year" > # Verify that the data is written with appropriate partitions > display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals")) > {noformat} > {code:sql} > %sql -- Now try to run a query against this table > select * from chelsea_goals > {code} > {noformat} > Error in SQL statement: UncheckedExecutionException: > java.lang.RuntimeException: Partition column year not found in schema > StructType(StructField(Goals,LongType,true), > StructField(Name,StringType,true), StructField(Year,LongType,true)) > {noformat} > {noformat} > # Now lets try this again using a lowercase column name > myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015), > Row(Name="Frank Lampard", Goals=15, year=2012)]) > myDF2 = sqlContext.createDataFrame(myRDD2) > myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2") > {noformat} > {code:sql} > %sql select * from chelsea_goals2; > --Now everything works > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org