[GitHub] [spark] SaurabhChawla100 commented on a change in pull request #32972: [SPARK-35756][SQL] unionByName supports struct having same col names but different sequence

GitBox Tue, 22 Jun 2021 11:45:20 -0700


SaurabhChawla100 commented on a change in pull request #32972:
URL: https://github.com/apache/spark/pull/32972#discussion_r656491450




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala
##########
@@ -182,6 +182,15 @@ object ResolveUnion extends Rule[LogicalPlan] {
             // union have consistent schema.
             aliased += foundAttr
             Alias(addFields(foundAttr, target), foundAttr.name)()
+          case (source: StructType, target: StructType)
+            if !allowMissingCol && !source.sameType(target) &&
+              target.toAttributes.map(attr => attr.name).sorted
+                == source.toAttributes.map(x => x.name).sorted =>

Review comment:
       UnionByName with allowMissing columns true add it as the missing column 
in case of case senistive  attributes for both the scenarios  
spark.sql.caseSensitive as true and false
   
   ```
   case class UnionClass2(a: Int, c: String)
   case class UnionClass4(A: Int, b: Long)
   case class UnionClass1a(a: Int, b: Long, nested: UnionClass2)
   case class UnionClass1c(a: Int, b: Long, nested: UnionClass4)
   
   
    val df1 = Seq((0, UnionClass1a(0, 1L, UnionClass2(1, "2")))).toDF("id", "a")
    val df2 = Seq((1, UnionClass1c(1, 2L, UnionClass4(2, 3L)))).toDF("id", "a")
   
   case 1 - set spark.sql.caseSensitive=false
   
   scala> spark.sql("set spark.sql.caseSensitive=false")
   res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
   
   scala> var unionDf = df1.unionByName(df2, true)
   unionDf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, 
a: struct<a: int, b: bigint ... 1 more field>]
   
   scala> unionDf.schema.toDDL
   res7: String = `id` INT,`a` STRUCT<`a`: INT, `b`: BIGINT, `nested`: 
STRUCT<`a`: INT, `b`: BIGINT, `c`: STRING, `A`: INT>>
   
   case 2 -> when spark.sql.caseSensitive is enabled 
   scala> spark.sql("set spark.sql.caseSensitive=true")
   res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
   
   
   scala> 
   
   scala> var unionDf = df1.unionByName(df2, true)
   unionDf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, 
a: struct<a: int, b: bigint ... 1 more field>]
   
   scala> unionDf.schema.toDDL
   res3: String = `id` INT,`a` STRUCT<`a`: INT, `b`: BIGINT, `nested`: 
STRUCT<`A`: INT, `a`: INT, `b`: BIGINT, `c`: STRING>>
   ```
   
   for UnionbyName without allowMissing -> we cannot add the missing column ,it 
should give the exception with schema not same so union cannot be done  so that 
is the reason for not converting into lower case and than do the comparison
   
   ```
   > var unionDf = df1.unionByName(df2)
   org.apache.spark.sql.AnalysisException: Union can only be performed on 
tables with the compatible column types. 
struct<a:int,b:bigint,nested:struct<A:int,b:bigint>> <> 
struct<a:int,b:bigint,nested:struct<a:int,c:string>> at the second column of 
the second table;
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] SaurabhChawla100 commented on a change in pull request #32972: [SPARK-35756][SQL] unionByName supports struct having same col names but different sequence

Reply via email to