SaurabhChawla100 commented on pull request #32972:
URL: https://github.com/apache/spark/pull/32972#issuecomment-864568203
> Yeah I'm saying it only properly handles one level of a nested struct, not
recursively like it should. Got your code running to show an example:
>
> ```
> >>> df1 = spark.createDataFrame([Row(a=Row(aa=Row(aaa=1)))])
> >>> df2 = spark.createDataFrame([Row(a=Row(aa=Row(aab=1)))])
> >>> df1
> DataFrame[a: struct<aa:struct<aaa:bigint>>]
> >>> df2
> DataFrame[a: struct<aa:struct<aab:bigint>>]
> >>> df1.unionByName(df2)
> DataFrame[a: struct<aa:struct<aaa:bigint,aab:bigint>>]
> >>> df1.unionByName(df2).explain()
> == Physical Plan ==
> Union
> :- *(1) Project [if (isnull(a#21)) null else named_struct(aa, if
(isnull(a#21.aa)) null else named_struct(aaa, a#21.aa.aaa, aab, null)) AS a#37]
> : +- *(1) Scan ExistingRDD[a#21]
> +- *(2) Project [if (isnull(a#23)) null else named_struct(aa, if
(isnull(a#23.aa)) null else named_struct(aaa, null, aab, a#23.aa.aab)) AS a#34]
> +- *(2) Scan ExistingRDD[a#23]
> ```
>
> The inner struct gets merged adding missing columns even though
allowMissingCol is false
Thank you for explaining the nested struct scenario. For allowMissingCol as
false , we need to do only the sort so instead of using addFields method used
sortStructFields
added the unit test for the same
```
case class UnionClass1d(c1: Int, c2: Int, c3: Struct3)
case class UnionClass1e(c2: Int, c1: Int, c3: Struct4)
case class Struct3(c3: Int)
case class Struct4(c4: Int)
df1 = Seq((1, 2, UnionClass1d(1, 2, Struct3(1)))).toDF("a", "b", "c")
df2 = Seq((1, 2, UnionClass1e(1, 2, Struct4(1)))).toDF("a", "b", "c")
df1.unionByName(df2) -> This will not add the missing column, instead will
throw exception
"Union can only be performed on tables with the compatible column types." +
" struct<c1:int,c2:int,c3:struct<c4:int>> <>
struct<c1:int,c2:int,c3:struct<c3:int>>" +
" at the third column of the second table"
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]