Michel Lemay created SPARK-21021:
------------------------------------
Summary: Reading partitioned parquet does not respect specified
schema column order
Key: SPARK-21021
URL: https://issues.apache.org/jira/browse/SPARK-21021
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.1.0
Reporter: Michel Lemay
Priority: Minor
When reading back a partitioned parquet folder, column order gets messed up.
Consider the following example:
{code:scala}
case class Event(f1: String, f2: String, f3: String)
val df = Seq(Event("v1", "v2", "v3")).toDF
df.write.partitionBy("f1", "f2").parquet("out")
val schema: StructType = StructType(StructField("f1", StringType, true) ::
StructField("f2", StringType, true) :: StructField("f3", StringType, true) ::
Nil)
val dfRead = spark.read.schema(schema).parquet("out")
dfRead.show
+---+---+---+
| f3| f1| f2|
+---+---+---+
| v3| v1| v2|
+---+---+---+
dfRead.columns
Array[String] = Array(f3, f1, f2)
schema.fields
Array(StructField(f1,StringType,true), StructField(f2,StringType,true),
StructField(f3,StringType,true))
{code}
This makes it really hard to have compatible schema when reading from multiple
sources.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]