cloud-fan commented on PR #37634: URL: https://github.com/apache/spark/pull/37634#issuecomment-1257692263
I made `SparkSession.internalCreateDataFrame` public to easily test the nullability mismatch bug. You can also use a data source to reproduce it. ``` val rdd = sc.makeRDD(Seq(InternalRow(null))) val df = spark.internalCreateDataFrame(rdd, new StructType().add("i", "int", false)) df.show +---+ | i| +---+ | 0| +---+ ``` If you use the public APIs like `SparkSession.createDataFrame`, they use `RowEncoder` which does runtime null check via expression `GetExternalRowField`. An example ``` val rdd = sc.makeRDD(Seq(Row(null))) val df = spark.createDataFrame(rdd, new StructType().add("i", "int", false)) df.show java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: The 0th field 'i' of input row cannot be null. ``` I think this is the right direction to go: for untrusted data, add extra validation even if it has perf overhead. If perf is very critical and we don't want any overhead, we can add a flag to skip this check and trust the data. try-catch NPE seems a bit hacky and can't cover all the cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org