xkrogen commented on code in PR #38660:
URL: https://github.com/apache/spark/pull/38660#discussion_r1043895649


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala:
##########
@@ -117,6 +117,7 @@ abstract class Expression extends TreeNode[Expression] {
   lazy val deterministic: Boolean = children.forall(_.deterministic)
 
   def nullable: Boolean
+  def trustNullability: Boolean = true

Review Comment:
   `nullable` <- this is the nullability according to the schema
   `trustNullability` <- whether or not we should _trust_ that schema
   
   For internal operators, we want to "trust" the nullability of their output, 
i.e., we completely eliminate null checks for those outputs. However for 
"untrusted" sources like UDFs, we can't trust the reported nullability, so we 
still generate a null check and throw an exception if a null appears despite 
the schema reporting non-null. If designing something from scratch I might make 
nullable have three possible values: `NULLABLE`, `NONNULL`, `NONULL_UNTRUSTED` 
to indicate the three states, but it's not feasible in the current framework.
   
   I am also not that excited about having to add this new field for all 
`Expression`s, but I couldn't find another way to propagate it. The problem is 
that the null check typically happens in the operator _consuming_ the untrusted 
output. So changing the codegen for e.g. the UDF isn't sufficient; it requires 
updating the codegen for `BoundReference`, `GetStructField`, etc. And these 
consumer operators have to adjust their codegen based on the "trustworthiness" 
of their input operators. The input operators are generic `Expression` 
instances, so to propagate the trustworthiness, the only way I found was to add 
this new property on `Expression`. But I am very open to any suggestions on a 
more clean way to propagate the required information! (Also open to suggestions 
on a better name if we do keep the property ... I'm not thrilled with the 
current name.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to