Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/10296#issuecomment-165176419
  
    @marmbrus Reshaped this PR to only fix those nullability bugs.
    
    After some more investigation, now I don't think we can resolve SPARK-12323 
by fixing `NewInstance`. The reasons are:
    
    1.  `ExpressionEncoder`s are always created using reflection based schema 
inference, which implies that the only non-nullable fields within a 
`fromRowExpression` are those of unboxed primitive types.
    1.  Unboxed primitive fields are always retrieved using code generated in 
`BoundReference` rathar than `NewInstance`, since `NewInstance` is only used to 
build objects.
    
    Since we would like to avoid per row runtime null checking and branching 
cost (what @davies and @nongli are working on), we'll have to assume the 
nullability of input data always match the schema of the `ExpressionEncoder` 
being used.  Another not quite appealing choice is to add an option to generate 
code with null checking, so that users can use it for debugging purposes.
    
    On the other hand, we can and should ensure nullability of the underlying 
logical plan is consistent with the Dataset while constructing a Dataset. For 
example, currently the following case works:
    
    ```scala
    val rowRDD = sqlContext.sparkContext.parallelize(Seq(Row("hello"), 
Row(null)))
    val schema = StructType(Seq(StructField("_1", StringType, nullable = 
false)))
    val df = sqlContext.createDataFrame(rowRDD, schema)
    df.as[Tuple1[String]].collect().foreach(println)
    
    // Output:
    //
    //   (hello)
    //   (null)
    ```
    
    This analysis time checking can be done in `ExpressionEncoder.resolve` by 
comparing schemata of the logical plan and the encoder. Opened PR #10331 for 
this check.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to