[GitHub] [spark] SaurabhChawla100 commented on pull request #32972: [SPARK-35756][SQL] unionByName supports struct having same col names but different sequence

GitBox Sun, 20 Jun 2021 08:05:28 -0700


SaurabhChawla100 commented on pull request #32972:
URL: https://github.com/apache/spark/pull/32972#issuecomment-864568203



   > Yeah I'm saying it only properly handles one level of a nested struct, not 
recursively like it should. Got your code running to show an example:
   > 
   > ```
   > >>> df1 = spark.createDataFrame([Row(a=Row(aa=Row(aaa=1)))])
   > >>> df2 = spark.createDataFrame([Row(a=Row(aa=Row(aab=1)))])
   > >>> df1
   > DataFrame[a: struct<aa:struct<aaa:bigint>>]
   > >>> df2
   > DataFrame[a: struct<aa:struct<aab:bigint>>]
   > >>> df1.unionByName(df2)
   > DataFrame[a: struct<aa:struct<aaa:bigint,aab:bigint>>]
   > >>> df1.unionByName(df2).explain()
   > == Physical Plan ==
   > Union
   > :- *(1) Project [if (isnull(a#21)) null else named_struct(aa, if 
(isnull(a#21.aa)) null else named_struct(aaa, a#21.aa.aaa, aab, null)) AS a#37]
   > :  +- *(1) Scan ExistingRDD[a#21]
   > +- *(2) Project [if (isnull(a#23)) null else named_struct(aa, if 
(isnull(a#23.aa)) null else named_struct(aaa, null, aab, a#23.aa.aab)) AS a#34]
   >    +- *(2) Scan ExistingRDD[a#23]
   > ```
   > 
   > The inner struct gets merged adding missing columns even though 
allowMissingCol is false
   
   Thank you for explaining the nested struct scenario. For allowMissingCol as 
false , we need to do only the sort so instead of using addFields method used 
sortStructFields 
   
   added the unit test for the same
   
   ```
   case class UnionClass1d(c1: Int, c2: Int, c3: Struct3)
   case class UnionClass1e(c2: Int, c1: Int, c3: Struct4)
   case class Struct3(c3: Int)
   case class Struct4(c4: Int)
   
    df1 = Seq((1, 2, UnionClass1d(1, 2, Struct3(1)))).toDF("a", "b", "c")
    df2 = Seq((1, 2, UnionClass1e(1, 2, Struct4(1)))).toDF("a", "b", "c")
   
   df1.unionByName(df2) -> This will not add the missing column, instead will 
throw exception
    "Union can only be performed on tables with the compatible column types." +
           " struct<c1:int,c2:int,c3:struct<c4:int>> <> 
struct<c1:int,c2:int,c3:struct<c3:int>>" +
           " at the third column of the second table"
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] SaurabhChawla100 commented on pull request #32972: [SPARK-35756][SQL] unionByName supports struct having same col names but different sequence

Reply via email to