[GitHub] spark pull request: [SPARK-12335][SPARK-12336][SPARK-12341][SPARK-...

marmbrus Wed, 16 Dec 2015 10:58:45 -0800

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/10296#issuecomment-165211474
  
    > After some more investigation, now I don't think we can resolve 
SPARK-12323 by fixing NewInstance.
    
    You could augment `NewInstance` to understand which arguments are primitive 
types or you could infer this using reflection on the constructor.  Though I 
think the resulting expression tree would be clearer if you just created the 
following new expression and inserted it into the tree for primitive types that 
are the children of a `NewInstance`.
    
    ```scala
    case class AssertNotNull(path: String, child: Expression) ...
    ```
    
    I think there may be some confusion about the schema guarantees for 
encoders and their expressions.  When there is a primitive type, the 
corresponding `toRowExpressions` will be non-nullable, since the data is coming 
from an object that can't possible store a null value.  In contrast, the 
`fromRowExpression` are reading from arbitrary input data, and thus their 
nullablity has nothing to do with the structure of the target object.  Instead 
this information is coming from the schema that the encoder is resolved/bound 
to.  Therefore, using the `nullable` bit here is incorrect.  `nullable` should 
always be thought of as a promise about the input data that we can use for 
optimization, not a constraint enforcement mechanism.
    
    > Another not quite appealing choice is to add an option to generate code 
with null checking, so that users can use it for debugging purposes.
    
    I think it would actually be very good to have assertions that null data 
does not appear where it is not expected.  When we actually start using this 
information we are almost certainly going to find more places we are not 
propagating the information correctly.  However, we need to ensure that these 
are elided in production to avoid invalidating the optimization this 
information is supposed to enable.
    
    > This analysis time checking can be done in ExpressionEncoder.resolve by 
comparing schemata of the logical plan and the encoder.
    
    I don't agree that you can verify the problem with this code statically.  
Creating a schema that says that `_1` is not nullable is valid, even though a 
String can be null in the JVM.  Again, this is a promise about the data itself, 
so you can't assert that there is a problem until that promise is violated and 
you see a null value in this column.
    
    That said, trusting the user to get this right has [led to 
confusion](https://issues.apache.org/jira/browse/SPARK-11319) in the past.  So 
I would propose that we do add validations at the `sqlContext.createDataX` 
boundaries.  Internally, though, we should trust this bit so that we can avoid 
unnecessary/expensive null checks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-12335][SPARK-12336][SPARK-12341][SPARK-...

Reply via email to