Josh Rosen created SPARK-14584:
----------------------------------

             Summary: Improve recognition of non-nullability in Dataset 
transformations
                 Key: SPARK-14584
                 URL: https://issues.apache.org/jira/browse/SPARK-14584
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Josh Rosen


There are many cases where we can statically know that a field will never be 
null. For instance, a field in a case class with a primitive type will never 
return null. However, there are currently several cases in the Dataset API 
where we do not properly recognize this non-nullability. For instance:

{code}
case class MyCaseClass(foo: Int)
sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
{code}

claims that the {{foo}} field is nullable even though this is impossible.

I believe that this is due to the way that we reason about nullability when 
constructing serializer expressions in ExpressionEncoders. The following 
assertion will currently fail if added to ExpressionEncoder:

{code}

  require(schema.size == serializer.size)
  schema.fields.zip(serializer).foreach { case (field, fieldSerializer) =>
    require(field.dataType == fieldSerializer.dataType, s"Field ${field.name}'s 
data type is " +
      s"${field.dataType} in the schema but ${fieldSerializer.dataType} in its 
serializer")
    require(field.nullable == fieldSerializer.nullable, s"Field ${field.name}'s 
nullability is " +
      s"${field.nullable} in the schema but ${fieldSerializer.nullable} in its 
serializer")
  }
{code}

Most often, the schema claims that a field is non-nullable while the encoder 
allows for nullability, but occasionally we see a mismatch in the datatypes due 
to disagreements over the nullability of nested structs' fields (or fields of 
structs in arrays).

I think the problem is that when we're reasoning about nullability in a 
struct's schema we consider its fields' nullability to be independent of the 
nullability of the struct itself, whereas in the serializer expressions we are 
considering those field extraction expressions to be nullable if the input 
objects themselves can be nullable.

I'm not sure what's the simplest way to fix this. One proposal would be to 
leave the serializers unchanged and have ObjectOperator derive its output 
attributes from an explicitly-passed schema rather than using the serializers' 
attributes. However, I worry that this might introduce bugs in case the 
serializer and schema disagree.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to