[
https://issues.apache.org/jira/browse/SPARK-14584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970306#comment-15970306
]
Hyukjin Kwon commented on SPARK-14584:
--------------------------------------
[~joshrosen], it seems now it recognise the non-nullability as below:
{code}
scala> case class MyCaseClass(foo: Int)
defined class MyCaseClass
scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
root
|-- foo: integer (nullable = false)
{code}
Could we resolve this?
> Improve recognition of non-nullability in Dataset transformations
> -----------------------------------------------------------------
>
> Key: SPARK-14584
> URL: https://issues.apache.org/jira/browse/SPARK-14584
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Josh Rosen
>
> There are many cases where we can statically know that a field will never be
> null. For instance, a field in a case class with a primitive type will never
> return null. However, there are currently several cases in the Dataset API
> where we do not properly recognize this non-nullability. For instance:
> {code}
> case class MyCaseClass(foo: Int)
> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
> {code}
> claims that the {{foo}} field is nullable even though this is impossible.
> I believe that this is due to the way that we reason about nullability when
> constructing serializer expressions in ExpressionEncoders. The following
> assertion will currently fail if added to ExpressionEncoder:
> {code}
> require(schema.size == serializer.size)
> schema.fields.zip(serializer).foreach { case (field, fieldSerializer) =>
> require(field.dataType == fieldSerializer.dataType, s"Field
> ${field.name}'s data type is " +
> s"${field.dataType} in the schema but ${fieldSerializer.dataType} in
> its serializer")
> require(field.nullable == fieldSerializer.nullable, s"Field
> ${field.name}'s nullability is " +
> s"${field.nullable} in the schema but ${fieldSerializer.nullable} in
> its serializer")
> }
> {code}
> Most often, the schema claims that a field is non-nullable while the encoder
> allows for nullability, but occasionally we see a mismatch in the datatypes
> due to disagreements over the nullability of nested structs' fields (or
> fields of structs in arrays).
> I think the problem is that when we're reasoning about nullability in a
> struct's schema we consider its fields' nullability to be independent of the
> nullability of the struct itself, whereas in the serializer expressions we
> are considering those field extraction expressions to be nullable if the
> input objects themselves can be nullable.
> I'm not sure what's the simplest way to fix this. One proposal would be to
> leave the serializers unchanged and have ObjectOperator derive its output
> attributes from an explicitly-passed schema rather than using the
> serializers' attributes. However, I worry that this might introduce bugs in
> case the serializer and schema disagree.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]