cloud-fan commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1257692263

   I made `SparkSession.internalCreateDataFrame` public to easily test the 
nullability mismatch bug. You can also use a data source to reproduce it.
   ```
   val rdd = sc.makeRDD(Seq(InternalRow(null)))
   val df = spark.internalCreateDataFrame(rdd, new StructType().add("i", "int", 
false))
   df.show
   +---+
   |  i|
   +---+
   |  0|
   +---+
   ```
   
   If you use the public APIs like `SparkSession.createDataFrame`, they use 
`RowEncoder` which does runtime null check via expression 
`GetExternalRowField`. An example
   ```
   val rdd = sc.makeRDD(Seq(Row(null)))
   val df = spark.createDataFrame(rdd, new StructType().add("i", "int", false))
   df.show
   java.lang.RuntimeException: Error while encoding: 
java.lang.RuntimeException: The 0th field 'i' of input row cannot be null.
   ```
   
   I think this is the right direction to go: for untrusted data, add extra 
validation even if it has perf overhead. If perf is very critical and we don't 
want any overhead, we can add a flag to skip this check and trust the data. 
try-catch NPE seems a bit hacky and can't cover all the cases.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to