[
https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-35290:
------------------------------------
Assignee: Apache Spark
> unionByName with null filling fails for some nested structs
> -----------------------------------------------------------
>
> Key: SPARK-35290
> URL: https://issues.apache.org/jira/browse/SPARK-35290
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.1
> Reporter: Adam Binford
> Assignee: Apache Spark
> Priority: Major
>
> We've encountered a few weird edge cases that seem to fail the new null
> filling unionByName (which has been a great addition!). It seems to stem from
> the fields being sorted by name and corrupted along the way. The simple
> reproduction is:
> {code:java}
> df = spark.createDataFrame([[]])
> df1 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('ba').alias('ba')
> ).alias('b')
> ))
> )
> df2 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('aa').alias('aa')
> ).alias('a'),
> F.struct(
> F.lit('bb').alias('bb')
> ).alias('b'),
> ))
> )
> df1.unionByName(df2, True).printSchema()
> {code}
> This results in the exception:
> {code:java}
> pyspark.sql.utils.AnalysisException: Union can only be performed on tables
> with the compatible column types.
> struct<a:struct<aa:string>,b:struct<ba:string,bb:string>> <>
> struct<a:struct<aa:string>,b:struct<aa:string,bb:string>> at the first column
> of the second table;
> {code}
> You can see in the second schema that it has
> {code:java}
> b:struct<aa:string,bb:string>
> {code}
> when it should be
> {code:java}
> b:struct<ba:string:bb:string>
> {code}
> It seems to happen somewhere during
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
> as everything seems correct up to that point from my testing. It's either
> modifying one expression during the transformUp then corrupts other
> expressions that are then modified, or the ExtractValue before the
> addFieldsInto is remembering the ordinal position in the struct that is then
> changing and causing issues.
>
> I found that simply using sortStructFields instead of
> sortStructFieldsInWithFields gets things working correctly, but definitely
> has a performance impact. The deep expr unionByName test takes ~1-2 seconds
> normally but ~12-15 seconds with this change. I assume because the original
> method tried to rewrite existing expressions vs the sortStructFields just
> adds expressions on top of existing ones to project the new order.
> I'm not sure if it makes sense to take the slower but works in the edge cases
> method (assuming it doesn't break other cases, all existing tests pass), or
> if there's a way to fix the existing method for cases like this.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]