[GitHub] [spark] sadikovi commented on a diff in pull request #37419: [SPARK-39833][SQL] Remove partition columns from data schema in the case of overlapping columns to fix Parquet DSv1 incorrect count issue

GitBox Tue, 09 Aug 2022 00:08:54 -0700


sadikovi commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r940976365



##########
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##########
@@ -2777,18 +2777,24 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
     }
   }
 
-  test("SPARK-22356: overlapped columns between data and partition schema in 
data source tables") {
+  test("SPARK-39833: overlapped columns between data and partition schema in 
data source tables") {
+    // SPARK-39833 changed behaviour of the column order in the case of 
overlapping columns between
+    // data and partition schemas: data schema does not include partition 
columns anymore and the
+    // overlapping columns would appear at the end of the schema together with 
other partition
+    // columns.
     withTempPath { path =>
       Seq((1, 1, 1), (1, 2, 1)).toDF("i", "p", "j")
         .write.mode("overwrite").parquet(new File(path, 
"p=1").getCanonicalPath)
       withTable("t") {
         sql(s"create table t using parquet 
options(path='${path.getCanonicalPath}')")
-        // We should respect the column order in data schema.
-        assert(spark.table("t").columns === Array("i", "p", "j"))
+        // MSCK command is required now to update partitions in the catalog.
+        sql(s"msck repair table t")
+
+        assert(spark.table("t").columns === Array("i", "j", "p"))
         checkAnswer(spark.table("t"), Row(1, 1, 1) :: Row(1, 1, 1) :: Nil)
         // The DESC TABLE should report same schema as table scan.
         assert(sql("desc t").select("col_name")
-          .as[String].collect().mkString(",").contains("i,p,j"))
+          .as[String].collect().mkString(",").contains("i,j,p"))

Review Comment:
   Partition columns are always appended to the schema. In the case of 
overlapping columns, we now remove all of the partition columns from the schema 
and append them afterwards. This does not change the result but changes the 
column output.
   
   Essentially:
   data schema: `i, p, j`, partition schema: `p`. We will remove `p` and append 
partition column: `i, j, p`.
   
   Previously we would keep the partition column as part of the data schema and 
insert partition values into it, which IMHO a bit confusing. This change also 
makes it compatible with DSv2 which is how it works there.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sadikovi commented on a diff in pull request #37419: [SPARK-39833][SQL] Remove partition columns from data schema in the case of overlapping columns to fix Parquet DSv1 incorrect count issue

Reply via email to