[ 
https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-35290:
-----------------------------------

    Assignee: Adam Binford

> unionByName with null filling fails for some nested structs
> -----------------------------------------------------------
>
>                 Key: SPARK-35290
>                 URL: https://issues.apache.org/jira/browse/SPARK-35290
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.1
>            Reporter: Adam Binford
>            Assignee: Adam Binford
>            Priority: Major
>
> We've encountered a few weird edge cases that seem to fail the new null 
> filling unionByName (which has been a great addition!). It seems to stem from 
> the fields being sorted by name and corrupted along the way. The simple 
> reproduction is:
> {code:java}
> df = spark.createDataFrame([[]])
> df1 = (df
>     .withColumn('top', F.struct(
>         F.struct(
>             F.lit('ba').alias('ba')
>         ).alias('b')
>     ))
> )
> df2 = (df
>     .withColumn('top', F.struct(
>         F.struct(
>             F.lit('aa').alias('aa')
>         ).alias('a'),
>         F.struct(
>             F.lit('bb').alias('bb')
>         ).alias('b'),
>     ))
> )
> df1.unionByName(df2, True).printSchema()
> {code}
> This results in the exception:
> {code:java}
> pyspark.sql.utils.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> struct<a:struct<aa:string>,b:struct<ba:string,bb:string>> <> 
> struct<a:struct<aa:string>,b:struct<aa:string,bb:string>> at the first column 
> of the second table;
> {code}
> You can see in the second schema that it has 
> {code:java}
> b:struct<aa:string,bb:string>
> {code}
> when it should be
> {code:java}
> b:struct<ba:string:bb:string>
> {code}
> It seems to happen somewhere during 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
>  as everything seems correct up to that point from my testing. It's either 
> modifying one expression during the transformUp then corrupts other 
> expressions that are then modified, or the ExtractValue before the 
> addFieldsInto is remembering the ordinal position in the struct that is then 
> changing and causing issues.
>  
> I found that simply using sortStructFields instead of 
> sortStructFieldsInWithFields gets things working correctly, but definitely 
> has a performance impact. The deep expr unionByName test takes ~1-2 seconds 
> normally but ~12-15 seconds with this change. I assume because the original 
> method tried to rewrite existing expressions vs the sortStructFields just 
> adds expressions on top of existing ones to project the new order.
> I'm not sure if it makes sense to take the slower but works in the edge cases 
> method (assuming it doesn't break other cases, all existing tests pass), or 
> if there's a way to fix the existing method for cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to