Josh Rosen created SPARK-14584:
----------------------------------
Summary: Improve recognition of non-nullability in Dataset
transformations
Key: SPARK-14584
URL: https://issues.apache.org/jira/browse/SPARK-14584
Project: Spark
Issue Type: Improvement
Components: SQL
Reporter: Josh Rosen
There are many cases where we can statically know that a field will never be
null. For instance, a field in a case class with a primitive type will never
return null. However, there are currently several cases in the Dataset API
where we do not properly recognize this non-nullability. For instance:
{code}
case class MyCaseClass(foo: Int)
sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
{code}
claims that the {{foo}} field is nullable even though this is impossible.
I believe that this is due to the way that we reason about nullability when
constructing serializer expressions in ExpressionEncoders. The following
assertion will currently fail if added to ExpressionEncoder:
{code}
require(schema.size == serializer.size)
schema.fields.zip(serializer).foreach { case (field, fieldSerializer) =>
require(field.dataType == fieldSerializer.dataType, s"Field ${field.name}'s
data type is " +
s"${field.dataType} in the schema but ${fieldSerializer.dataType} in its
serializer")
require(field.nullable == fieldSerializer.nullable, s"Field ${field.name}'s
nullability is " +
s"${field.nullable} in the schema but ${fieldSerializer.nullable} in its
serializer")
}
{code}
Most often, the schema claims that a field is non-nullable while the encoder
allows for nullability, but occasionally we see a mismatch in the datatypes due
to disagreements over the nullability of nested structs' fields (or fields of
structs in arrays).
I think the problem is that when we're reasoning about nullability in a
struct's schema we consider its fields' nullability to be independent of the
nullability of the struct itself, whereas in the serializer expressions we are
considering those field extraction expressions to be nullable if the input
objects themselves can be nullable.
I'm not sure what's the simplest way to fix this. One proposal would be to
leave the serializers unchanged and have ObjectOperator derive its output
attributes from an explicitly-passed schema rather than using the serializers'
attributes. However, I worry that this might introduce bugs in case the
serializer and schema disagree.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]