xkrogen commented on PR #37634: URL: https://github.com/apache/spark/pull/37634#issuecomment-1251319065
Thanks for the suggestion @cloud-fan ! Good point about there many places where Spark trusts nullability. Here I am trying to target places where _user code_ could introduce a null. This includes data sources (including in-memory dataset creation like `sparkContext.parallelize()` or `DatasetHolder.toDS()`) and UDFs (including lambdas in calls to DF/DS APIs). User code is inherently untrustworthy, so more checks are warranted. I think any of these places where user code supplies values would be covered by this PR since they all go through projection before being accessed by other codepaths, but LMK if you disagree. I guess one situation could be if the optimizer completely removes some operator because of the nullability, so the data is never even accessed--are you thinking about situations like this? Unfortunately we cannot use `AssertNotNull` for this; the optimizer will remove it if the input schema indicates that the schema is non-null: https://github.com/apache/spark/blob/96831bbb6749910c8f9497c048ba05e6e317649f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L826 And that optimizer rule is there for a good reason; there is nontrivial overhead associated with the null check as discussed further in [my other comment](https://github.com/apache/spark/pull/37634#discussion_r974499166). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
