[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

GitBox Thu, 16 Sep 2021 20:09:26 -0700


viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710696143




##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,64 @@ class DataFrameSetOperationsSuite extends QueryTest with 
SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Only merge nullability for Unions of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS 
INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS 
inner")))
+
+    val union1 = df1.union(df2)
+    val union2 = df1.unionByName(df2)
+
+    val schema = StructType(Seq(StructField("id", LongType, false),
+      StructField("nested", StructType(Seq(StructField("INNER", LongType, 
false))), false)))
+
+    Seq(union1, union2).foreach { df =>
+      assert(df.schema == schema)
+      assert(df.queryExecution.optimizedPlan.schema == schema)
+      assert(df.queryExecution.executedPlan.schema == schema)
+
+      checkAnswer(df, Row(0, Row(0)) :: Row(1, Row(5)) :: Row(0, Row(0)) :: 
Row(1, Row(5)) :: Nil)
+      checkAnswer(df.select("nested.*"), Row(0) :: Row(5) :: Row(0) :: Row(5) 
:: Nil)
+    }
+  }
+
+  test("SPARK-36673: Only merge nullability for unionByName of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS 
INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS 
inner")))
+
+    val df = df1.unionByName(df2)
+
+    val schema = StructType(Seq(StructField("id", LongType, false),
+      StructField("nested", StructType(Seq(StructField("INNER", LongType, 
false))), false)))
+
+    assert(df.schema == schema)
+    assert(df.queryExecution.optimizedPlan.schema == schema)
+    assert(df.queryExecution.executedPlan.schema == schema)
+
+    checkAnswer(df, Row(0, Row(0)) :: Row(1, Row(5)) :: Row(0, Row(0)) :: 
Row(1, Row(5)) :: Nil)
+    checkAnswer(df.select("nested.*"), Row(0) :: Row(5) :: Row(0) :: Row(5) :: 
Nil)
+  }
+
+  test("SPARK-36673: Union of structs with different orders") {
+    val df1 = spark.range(2).withColumn("nested",
+      struct(expr("id * 5 AS inner1"), struct(expr("id * 10 AS inner2"))))
+    val df2 = spark.range(2).withColumn("nested",
+      struct(expr("id * 5 AS inner2"), struct(expr("id * 10 AS inner1"))))

Review comment:
       This test is more to reflect/verify what the current behavior is (i.e. 
the added test works as it is before this change). That is good point about how 
we should treat union of nested columns. As mentioned in previous comment, we 
have `CheckAnalysis` as the guard to prevent invalid UNION by checking the 
compatibility of datatypes. And for now we compare the fields of structs one by 
one (by position). If the any fields at same position cannot be resolved, 
`CheckAnalysis` thinks it as mismatched data types and stops the query.
   
   This PR is mainly for fixing the merging behavior for UNION which is not 
correct.
   
   We can work on the mentioned issue separately.
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Reply via email to