[jira] [Updated] (SPARK-10562) .partitionBy() creates the metastore partition columns in all lowercase, but persists the data path as MixedCase resulting in an error when the data is later attempted to query.

Cheng Lian (JIRA) Sun, 25 Oct 2015 05:57:51 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Cheng Lian updated SPARK-10562:
-------------------------------
    Description: 
When using DataFrame.write.partitionBy().saveAsTable() it creates the partiton 
by columns in all lowercase in the meta-store.  However, it writes the data to 
the filesystem using mixed-case.

This causes an error when running a select against the table.
{noformat}
from pyspark.sql import Row

# Create a data frame with mixed case column names
myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015),
                       Row(Name="Frank Lampard", Goals=15, Year=2012)])

myDF = sqlContext.createDataFrame(myRDD)

# Write this data out to a parquet file and partition by the Year (which is a 
mixedCase name)
myDF.write.partitionBy("Year").saveAsTable("chelsea_goals")

%sql show create table chelsea_goals;
--The metastore is showwing a partition column name of all lowercase "year"

# Verify that the data is written with appropriate partitions
display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals"))
{noformat}

{code:sql}
%sql -- Now try to run a query against this table
select * from chelsea_goals
{code}

{noformat}
Error in SQL statement: UncheckedExecutionException: 
java.lang.RuntimeException: Partition column year not found in schema 
StructType(StructField(Goals,LongType,true), StructField(Name,StringType,true), 
StructField(Year,LongType,true))
{noformat}

{noformat}
# Now lets try this again using a lowercase column name
myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015),
                         Row(Name="Frank Lampard", Goals=15, year=2012)])

myDF2 = sqlContext.createDataFrame(myRDD2)

myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2")
{noformat}

{code:sql}
%sql select * from chelsea_goals2;
--Now everything works
{code}

  was:
When using DataFrame.write.partitionBy().saveAsTable() it creates the partiton 
by columns in all lowercase in the meta-store.  However, it writes the data to 
the filesystem using mixed-case.

This causes an error when running a select against the table.
--------------
from pyspark.sql import Row

# Create a data frame with mixed case column names
myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015),
                       Row(Name="Frank Lampard", Goals=15, Year=2012)])

myDF = sqlContext.createDataFrame(myRDD)

# Write this data out to a parquet file and partition by the Year (which is a 
mixedCase name)
myDF.write.partitionBy("Year").saveAsTable("chelsea_goals")

%sql show create table chelsea_goals;
--The metastore is showwing a partition column name of all lowercase "year"

# Verify that the data is written with appropriate partitions
display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals"))

%sql
--Now try to run a query against this table
select * from chelsea_goals

Error in SQL statement: UncheckedExecutionException: 
java.lang.RuntimeException: Partition column year not found in schema 
StructType(StructField(Goals,LongType,true), StructField(Name,StringType,true), 
StructField(Year,LongType,true))

# Now lets try this again using a lowercase column name
myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015),
                         Row(Name="Frank Lampard", Goals=15, year=2012)])

myDF2 = sqlContext.createDataFrame(myRDD2)

myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2")

%sql select * from chelsea_goals2;
--Now everything works





> .partitionBy() creates the metastore partition columns in all lowercase, but 
> persists the data path as MixedCase resulting in an error when the data is 
> later attempted to query.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10562
>                 URL: https://issues.apache.org/jira/browse/SPARK-10562
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.1
>            Reporter: Jason Pohl
>            Assignee: Wenchen Fan
>         Attachments: MixedCasePartitionBy.dbc
>
>
> When using DataFrame.write.partitionBy().saveAsTable() it creates the 
> partiton by columns in all lowercase in the meta-store.  However, it writes 
> the data to the filesystem using mixed-case.
> This causes an error when running a select against the table.
> {noformat}
> from pyspark.sql import Row
> # Create a data frame with mixed case column names
> myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015),
>                        Row(Name="Frank Lampard", Goals=15, Year=2012)])
> myDF = sqlContext.createDataFrame(myRDD)
> # Write this data out to a parquet file and partition by the Year (which is a 
> mixedCase name)
> myDF.write.partitionBy("Year").saveAsTable("chelsea_goals")
> %sql show create table chelsea_goals;
> --The metastore is showwing a partition column name of all lowercase "year"
> # Verify that the data is written with appropriate partitions
> display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals"))
> {noformat}
> {code:sql}
> %sql -- Now try to run a query against this table
> select * from chelsea_goals
> {code}
> {noformat}
> Error in SQL statement: UncheckedExecutionException: 
> java.lang.RuntimeException: Partition column year not found in schema 
> StructType(StructField(Goals,LongType,true), 
> StructField(Name,StringType,true), StructField(Year,LongType,true))
> {noformat}
> {noformat}
> # Now lets try this again using a lowercase column name
> myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015),
>                          Row(Name="Frank Lampard", Goals=15, year=2012)])
> myDF2 = sqlContext.createDataFrame(myRDD2)
> myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2")
> {noformat}
> {code:sql}
> %sql select * from chelsea_goals2;
> --Now everything works
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-10562) .partitionBy() creates the metastore partition columns in all lowercase, but persists the data path as MixedCase resulting in an error when the data is later attempted to query.

Reply via email to