[jira] [Commented] (SPARK-35290) unionByName with null filling fails for some nested structs

Adam Binford (Jira) Tue, 04 May 2021 04:25:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338921#comment-17338921
 ]


Adam Binford commented on SPARK-35290:
--------------------------------------

Running into some issues trying to get case insensitivity working correctly. I 
was hoping to be able to leave everything on both sides of the union in it's 
existing casing, and just get things in the right order, but I'm running into 
this issue:
{code:java}
>>> from pyspark.sql.functions import *
>>> df1 = spark.range(1).withColumn('top', struct(lit('A').alias('A')))
>>> df2 = spark.range(1).withColumn('top', struct(lit('a').alias('a')))
>>> spark.conf.set('spark.sql.caseSensitive', 'true')
...
pyspark.sql.utils.AnalysisException: Union can only be performed on tables with 
the compatible column types. struct<a:string> <> struct<A:string> at the second 
column of the second table;

>>> spark.conf.set('spark.sql.caseSensitive', 'false')
>>> df1.union(df2)
DataFrame[id: bigint, top: struct<A:string,a:string>]
>>> df1.unionByName(df2)
DataFrame[id: bigint, top: struct<A:string,a:string>]
>>> df1.unionByName(df2, True)
DataFrame[id: bigint, top: struct<A:string,a:string>]
{code}
With case sensitivity enabled, it errors out as expected because the two 
structs are different types. However, when case sensitivity is disabled, the 
union is happy because it sees them as the same type, but when the schemas are 
merged, it treats them as two separate fields. I assume it's related to the 
StructType.merge method, but I don't exactly know where that gets called in the 
context of a Union. I don't see anything in that merge function that handles 
case insensitivity. Is that a bug in itself or a feature?

> unionByName with null filling fails for some nested structs
> -----------------------------------------------------------
>
>                 Key: SPARK-35290
>                 URL: https://issues.apache.org/jira/browse/SPARK-35290
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.1
>            Reporter: Adam Binford
>            Priority: Major
>
> We've encountered a few weird edge cases that seem to fail the new null 
> filling unionByName (which has been a great addition!). It seems to stem from 
> the fields being sorted by name and corrupted along the way. The simple 
> reproduction is:
> {code:java}
> df = spark.createDataFrame([[]])
> df1 = (df
>     .withColumn('top', F.struct(
>         F.struct(
>             F.lit('ba').alias('ba')
>         ).alias('b')
>     ))
> )
> df2 = (df
>     .withColumn('top', F.struct(
>         F.struct(
>             F.lit('aa').alias('aa')
>         ).alias('a'),
>         F.struct(
>             F.lit('bb').alias('bb')
>         ).alias('b'),
>     ))
> )
> df1.unionByName(df2, True).printSchema()
> {code}
> This results in the exception:
> {code:java}
> pyspark.sql.utils.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> struct<a:struct<aa:string>,b:struct<ba:string,bb:string>> <> 
> struct<a:struct<aa:string>,b:struct<aa:string,bb:string>> at the first column 
> of the second table;
> {code}
> You can see in the second schema that it has 
> {code:java}
> b:struct<aa:string,bb:string>
> {code}
> when it should be
> {code:java}
> b:struct<ba:string:bb:string>
> {code}
> It seems to happen somewhere during 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
>  as everything seems correct up to that point from my testing. It's either 
> modifying one expression during the transformUp then corrupts other 
> expressions that are then modified, or the ExtractValue before the 
> addFieldsInto is remembering the ordinal position in the struct that is then 
> changing and causing issues.
>  
> I found that simply using sortStructFields instead of 
> sortStructFieldsInWithFields gets things working correctly, but definitely 
> has a performance impact. The deep expr unionByName test takes ~1-2 seconds 
> normally but ~12-15 seconds with this change. I assume because the original 
> method tried to rewrite existing expressions vs the sortStructFields just 
> adds expressions on top of existing ones to project the new order.
> I'm not sure if it makes sense to take the slower but works in the edge cases 
> method (assuming it doesn't break other cases, all existing tests pass), or 
> if there's a way to fix the existing method for cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-35290) unionByName with null filling fails for some nested structs

Reply via email to