SaurabhChawla100 commented on a change in pull request #32972:
URL: https://github.com/apache/spark/pull/32972#discussion_r656491450
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala
##########
@@ -182,6 +182,15 @@ object ResolveUnion extends Rule[LogicalPlan] {
// union have consistent schema.
aliased += foundAttr
Alias(addFields(foundAttr, target), foundAttr.name)()
+ case (source: StructType, target: StructType)
+ if !allowMissingCol && !source.sameType(target) &&
+ target.toAttributes.map(attr => attr.name).sorted
+ == source.toAttributes.map(x => x.name).sorted =>
Review comment:
UnionByName with allowMissing columns true add it as the missing column
in case of case senistive attributes for both the scenarios
spark.sql.caseSensitive as true and false
```
case class UnionClass2(a: Int, c: String)
case class UnionClass4(A: Int, b: Long)
case class UnionClass1a(a: Int, b: Long, nested: UnionClass2)
case class UnionClass1c(a: Int, b: Long, nested: UnionClass4)
val df1 = Seq((0, UnionClass1a(0, 1L, UnionClass2(1, "2")))).toDF("id", "a")
val df2 = Seq((1, UnionClass1c(1, 2L, UnionClass4(2, 3L)))).toDF("id", "a")
case 1 - set spark.sql.caseSensitive=false
scala> spark.sql("set spark.sql.caseSensitive=false")
res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> var unionDf = df1.unionByName(df2, true)
unionDf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int,
a: struct<a: int, b: bigint ... 1 more field>]
scala> unionDf.schema.toDDL
res7: String = `id` INT,`a` STRUCT<`a`: INT, `b`: BIGINT, `nested`:
STRUCT<`a`: INT, `b`: BIGINT, `c`: STRING, `A`: INT>>
case 2 -> when spark.sql.caseSensitive is enabled
scala> spark.sql("set spark.sql.caseSensitive=true")
res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala>
scala> var unionDf = df1.unionByName(df2, true)
unionDf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int,
a: struct<a: int, b: bigint ... 1 more field>]
scala> unionDf.schema.toDDL
res3: String = `id` INT,`a` STRUCT<`a`: INT, `b`: BIGINT, `nested`:
STRUCT<`A`: INT, `a`: INT, `b`: BIGINT, `c`: STRING>>
```
for UnionbyName without allowMissing -> we cannot add the missing column ,it
should give the exception with schema not same so union cannot be done so that
is the reason for not converting into lower case and than do the comparison
```
> var unionDf = df1.unionByName(df2)
org.apache.spark.sql.AnalysisException: Union can only be performed on
tables with the compatible column types.
struct<a:int,b:bigint,nested:struct<A:int,b:bigint>> <>
struct<a:int,b:bigint,nested:struct<a:int,c:string>> at the second column of
the second table;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]