[
https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338921#comment-17338921
]
Adam Binford commented on SPARK-35290:
--------------------------------------
Running into some issues trying to get case insensitivity working correctly. I
was hoping to be able to leave everything on both sides of the union in it's
existing casing, and just get things in the right order, but I'm running into
this issue:
{code:java}
>>> from pyspark.sql.functions import *
>>> df1 = spark.range(1).withColumn('top', struct(lit('A').alias('A')))
>>> df2 = spark.range(1).withColumn('top', struct(lit('a').alias('a')))
>>> spark.conf.set('spark.sql.caseSensitive', 'true')
...
pyspark.sql.utils.AnalysisException: Union can only be performed on tables with
the compatible column types. struct<a:string> <> struct<A:string> at the second
column of the second table;
>>> spark.conf.set('spark.sql.caseSensitive', 'false')
>>> df1.union(df2)
DataFrame[id: bigint, top: struct<A:string,a:string>]
>>> df1.unionByName(df2)
DataFrame[id: bigint, top: struct<A:string,a:string>]
>>> df1.unionByName(df2, True)
DataFrame[id: bigint, top: struct<A:string,a:string>]
{code}
With case sensitivity enabled, it errors out as expected because the two
structs are different types. However, when case sensitivity is disabled, the
union is happy because it sees them as the same type, but when the schemas are
merged, it treats them as two separate fields. I assume it's related to the
StructType.merge method, but I don't exactly know where that gets called in the
context of a Union. I don't see anything in that merge function that handles
case insensitivity. Is that a bug in itself or a feature?
> unionByName with null filling fails for some nested structs
> -----------------------------------------------------------
>
> Key: SPARK-35290
> URL: https://issues.apache.org/jira/browse/SPARK-35290
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.1
> Reporter: Adam Binford
> Priority: Major
>
> We've encountered a few weird edge cases that seem to fail the new null
> filling unionByName (which has been a great addition!). It seems to stem from
> the fields being sorted by name and corrupted along the way. The simple
> reproduction is:
> {code:java}
> df = spark.createDataFrame([[]])
> df1 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('ba').alias('ba')
> ).alias('b')
> ))
> )
> df2 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('aa').alias('aa')
> ).alias('a'),
> F.struct(
> F.lit('bb').alias('bb')
> ).alias('b'),
> ))
> )
> df1.unionByName(df2, True).printSchema()
> {code}
> This results in the exception:
> {code:java}
> pyspark.sql.utils.AnalysisException: Union can only be performed on tables
> with the compatible column types.
> struct<a:struct<aa:string>,b:struct<ba:string,bb:string>> <>
> struct<a:struct<aa:string>,b:struct<aa:string,bb:string>> at the first column
> of the second table;
> {code}
> You can see in the second schema that it has
> {code:java}
> b:struct<aa:string,bb:string>
> {code}
> when it should be
> {code:java}
> b:struct<ba:string:bb:string>
> {code}
> It seems to happen somewhere during
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
> as everything seems correct up to that point from my testing. It's either
> modifying one expression during the transformUp then corrupts other
> expressions that are then modified, or the ExtractValue before the
> addFieldsInto is remembering the ordinal position in the struct that is then
> changing and causing issues.
>
> I found that simply using sortStructFields instead of
> sortStructFieldsInWithFields gets things working correctly, but definitely
> has a performance impact. The deep expr unionByName test takes ~1-2 seconds
> normally but ~12-15 seconds with this change. I assume because the original
> method tried to rewrite existing expressions vs the sortStructFields just
> adds expressions on top of existing ones to project the new order.
> I'm not sure if it makes sense to take the slower but works in the edge cases
> method (assuming it doesn't break other cases, all existing tests pass), or
> if there's a way to fix the existing method for cases like this.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]