Huon Wilson created SPARK-27685:
-----------------------------------

             Summary: `union` doesn't promote non-nullable columns of struct to 
nullable
                 Key: SPARK-27685
                 URL: https://issues.apache.org/jira/browse/SPARK-27685
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Huon Wilson


When doing a {{union}} of two dataframes, a column that is nullable in one of 
the dataframes will be nullable in the union, promoting the non-nullable one to 
be nullable. 

This doesn't happen properly for columns nested as subcolumns of a {{struct}}. 
It seems to just take the nullability of the first dataframe in the union, 
meaning a nullable column will become non-nullable, resulting in invalid values.

{code:scala}
case class X(x: Option[Long])
case class Nested(nested: X)

// First, just work with normal columns
val df1 = Seq(1L, 2L).toDF("x")
val df2 = Seq(Some(3L), None).toDF("x")

df1.printSchema
// root
//  |-- x: long (nullable = false)

df2.printSchema
// root
//  |-- x: long (nullable = true)

(df1 union df2).printSchema
// root
//  |-- x: long (nullable = true)

(df1 union df2).as[X].collect
// res19: Array[X] = Array(X(Some(1)), X(Some(2)), X(Some(3)), X(None))

// Now, the same with the 'x' column within a struct:

val struct1 = df1.select(struct('x) as "nested")
val struct2 = df2.select(struct('x) as "nested")

struct1.printSchema
// root
//  |-- nested: struct (nullable = false)
//  |    |-- x: long (nullable = false)

struct2.printSchema
// root
//  |-- nested: struct (nullable = false)
//  |    |-- x: long (nullable = true)

// BAD: the x column is not nullable
(struct1 union struct2).printSchema
// root
//  |-- nested: struct (nullable = false)
//  |    |-- x: long (nullable = false)

// BAD: the last x value became "Some(0)", instead of "None"
(struct1 union struct2).as[Nested].collect
// res23: Array[Nested] = Array(Nested(X(Some(1))), Nested(X(Some(2))), 
Nested(X(Some(3))), Nested(X(Some(0))))

// Flipping the order makes x nullable as desired
(struct2 union struct1).printSchema
// root
//  |-- nested: struct (nullable = false)
//  |    |-- x: long (nullable = true)
(struct2 union struct1).as[Y].collect
// res26: Array[Y] = Array(Y(X(Some(3))), Y(X(None)), Y(X(Some(1))), 
Y(X(Some(2))))
{code}

Note the two {{BAD}} lines, where the union of structs became non-nullable and 
resulted in invalid values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to