JoshRosen commented on a change in pull request #24765: [SPARK-27915][SQL][WIP] 
Update logical Filter's output nullability based on IsNotNull conditions
URL: https://github.com/apache/spark/pull/24765#discussion_r289619399
 
 

 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
 ##########
 @@ -87,28 +87,25 @@ case class ProjectExec(projectList: Seq[NamedExpression], 
child: SparkPlan)
 case class FilterExec(condition: Expression, child: SparkPlan)
   extends UnaryExecNode with CodegenSupport with PredicateHelper {
 
+  // Mark this as empty. We'll evaluate the input during doConsume(). We don't 
want to evaluate
+  // all the variables at the beginning to take advantage of short circuiting.
+  override def usedInputs: AttributeSet = AttributeSet.empty
+
   // Split out all the IsNotNulls from condition.
   private val (notNullPreds, otherPreds) = 
splitConjunctivePredicates(condition).partition {
 
 Review comment:
   I found the old code here to be slightly confusing because it seemed to be 
using `notNullPreds` for two different purposes:
   
   1. If we see `IsNotNull` conjuncts in the filter then evaluate them first / 
earlier because (a) these expressions are cheap to evaluate and may allow for 
short-circuiting and skipping more expensive expressions, and (b) evaluating 
these earlier allows other expressions to omit null checks (for example, if we 
have `IsNotNull(x)` and `x * 100 < 10` then we _already_ implicitly need to 
null-check `x` as part of the second expression so we might as well do the 
explicit null check expression first).
   2. Given that tuples have successfully passed through the filter, we can 
rely on the presence of `IsNotNull` checks to default subsequent expressions' 
null checks to `false`. For example, let's say we had a `.filter().select()` 
which gets compiled into a single whole stage codegen: after tuples have passed 
through the filter we know that certain fields cannot possibly be null, so we 
can elide null checks _at codegen time_ by just setting `nullable = false` in 
subsequent code.
   
   There might be some subtleties related in (1) related to non-deterministic 
expressions, but I think that's accounted for further down at the place where 
we're actually generating the checks.
   
   In the old code, the `(notNullPreds, otherPreds)` on this line was being 
used for both purposes: for (1) I think we could simply collect _all_ 
`IsNotNull` expressions, but the existing implementation of (2) relied on the 
additional `nullIntolerant` / `a.references` checks in order to be correct.
   
   In this PR, I've separated these two usages: the "update nullability for 
downstream operators" now uses the more precise condition implemented in 
`getImpliedNotNullExprIds`, while the "optimize short-circuiting" simply checks 
for `IsNotNull` and ignores child attributes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to