xkrogen commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1251319065

   Thanks for the suggestion @cloud-fan ! Good point about there many places 
where Spark trusts nullability. Here I am trying to target places where _user 
code_ could introduce a null. This includes data sources (including in-memory 
dataset creation like `sparkContext.parallelize()` or `DatasetHolder.toDS()`) 
and UDFs (including lambdas in calls to DF/DS APIs). User code is inherently 
untrustworthy, so more checks are warranted. I think any of these places where 
user code supplies values would be covered by this PR since they all go through 
projection before being accessed by other codepaths, but LMK if you disagree. I 
guess one situation could be if the optimizer completely removes some operator 
because of the nullability, so the data is never even accessed--are you 
thinking about situations like this?
   
   Unfortunately we cannot use `AssertNotNull` for this; the optimizer will 
remove it if the input schema indicates that the schema is non-null:
   
https://github.com/apache/spark/blob/96831bbb6749910c8f9497c048ba05e6e317649f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L826
   And that optimizer rule is there for a good reason; there is nontrivial 
overhead associated with the null check as discussed further in [my other 
comment](https://github.com/apache/spark/pull/37634#discussion_r974499166).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to