Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/10296#issuecomment-165211474
> After some more investigation, now I don't think we can resolve
SPARK-12323 by fixing NewInstance.
You could augment `NewInstance` to understand which arguments are primitive
types or you could infer this using reflection on the constructor. Though I
think the resulting expression tree would be clearer if you just created the
following new expression and inserted it into the tree for primitive types that
are the children of a `NewInstance`.
```scala
case class AssertNotNull(path: String, child: Expression) ...
```
I think there may be some confusion about the schema guarantees for
encoders and their expressions. When there is a primitive type, the
corresponding `toRowExpressions` will be non-nullable, since the data is coming
from an object that can't possible store a null value. In contrast, the
`fromRowExpression` are reading from arbitrary input data, and thus their
nullablity has nothing to do with the structure of the target object. Instead
this information is coming from the schema that the encoder is resolved/bound
to. Therefore, using the `nullable` bit here is incorrect. `nullable` should
always be thought of as a promise about the input data that we can use for
optimization, not a constraint enforcement mechanism.
> Another not quite appealing choice is to add an option to generate code
with null checking, so that users can use it for debugging purposes.
I think it would actually be very good to have assertions that null data
does not appear where it is not expected. When we actually start using this
information we are almost certainly going to find more places we are not
propagating the information correctly. However, we need to ensure that these
are elided in production to avoid invalidating the optimization this
information is supposed to enable.
> This analysis time checking can be done in ExpressionEncoder.resolve by
comparing schemata of the logical plan and the encoder.
I don't agree that you can verify the problem with this code statically.
Creating a schema that says that `_1` is not nullable is valid, even though a
String can be null in the JVM. Again, this is a promise about the data itself,
so you can't assert that there is a problem until that promise is violated and
you see a null value in this column.
That said, trusting the user to get this right has [led to
confusion](https://issues.apache.org/jira/browse/SPARK-11319) in the past. So
I would propose that we do add validations at the `sqlContext.createDataX`
boundaries. Internally, though, we should trust this bit so that we can avoid
unnecessary/expensive null checks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]