JoshRosen opened a new pull request #24765: [SPARK-27915][SQL][WIP] Update 
logical Filter's output nullability based on IsNotNull conditions
URL: https://github.com/apache/spark/pull/24765
 
 
   ## What changes were proposed in this pull request?
   
   This PR changes the logical `Filter` operator to update its outputs' 
nullability when filter conditions imply that outputs cannot be null. 
   
   In addition, I refined similar existing logic in the physical `FilterExec` 
(changing the existing code to be more precise / less conservative in its 
non-nullability inference) and improved propagation of inferred nullability 
information in `Project`.
   
   This is useful because of how it composes with other optimizations: Spark 
has several logical and physical optimizations which leverage non-nullability, 
so improving nullability inference increases the value of those existing 
optimizations.
   
   :warning: **Disclaimers** :warning:
   
    - This is a work-in-progress / skunkworks side project; I'm not working on 
this full time.
    - Nullability has been a major source of bugs in the past: this PR requires 
careful review.
    - I haven't run analyzer / optimizer performance benchmarks, so there's a 
decent chance that this WIP changeset regresses query planning performance.
    - DataFrames / Datasets / queries' `.schema` may change as a result of this 
optimization: this may have consequences in case nullability information is 
used by downstream systems (e.g. for `CREATE TABLE` DDL).
    - The schemas of analyzed and optimized logical plans may now differ in 
terms of field nullability (because optimization might infer additional 
constraints which allow us to prove that fields are non-null).
    
   ### Examples
   
   Consider the query
   
   ```
   SELECT key
   FROM t
   WHERE key IS NOT NULL
   ```
   
   where `t.key` is nullable.
   
   Because of the `key IS NOT NULL` filter condition, `key` will always be 
non-null. Prior to this patch, this query's result schema was 
overly-conservative, continuing to mark  `key` as nullable. However, if we take 
advantage of the `key IS NOT NULL` condition we can set `nullable = false` for 
`key`.
   
   This was a somewhat trivial example, so let's look at some more complicated 
cases:
   
   Consider 
   
   ```
   SELECT A.key, A.value
   FROM A, B
   WHERE
       A.key = B.key AND
       (A.num + B.num) > 0
   ```
   
   where all columns of `A` and `B` are nullable. Because of the equality join 
condition, we know that `key` must be non-null in both tables. In addition, the 
condition `(A.num + B.num) > 0` can only hold if both `num` values are not 
null: addition is a _null-intolerant_ operator, meaning that it returns `null` 
if any of its operands is null.
   
   Leveraging this, we should be able to mark both `key` and `value` as 
non-null in the join result's schema (even though both values are nullable in 
the underlying input relation).
   
   Finally, let's look at an example of a **non** null-intolerant operator: 
`coalesce(a, b) IS NOT NULL` could still mean that `a` or `b` is null, so in
   
   ```
   SELECT key, foo, COALESCE(foo, bar) as qux
   FROM A
   WHERE COALESCE(foo, bar) > 0
   ```
   
   we can infer that `qux` is not null but cannot make any claims about 
`value`'s nullability.
   
   
   ### Description of changes
   
   - Introduce `PredicateHelper.getImpliedNotNullExprIds(IsNotNull)` helper, 
which takes an `IsNotNull` expression and returns the `ExprId`s of expressions 
which cannot be null. This handles simple cases like 
`IsNotNull(columnFromTable)`, as well as more complex cases involving 
expression trees (properly accounting for null-(in)tolerance).
     - There was similar existing logic in `FilterExec`, but I think it was 
overly conservative: given `IsNotNull(x)`, it would claim that `x` _and all of 
its descendants_ were not null if and only if every ancestor of `x` was 
`NullIntolerant`. However, even if `x` is null-tolerant we can still make 
claims about `x`'s non-nullability even if we can't make further claims about 
its children.
   - Update `logical.Filter` to leverage this new function to update output 
nullability.
   - Modify `FilterExec` to re-use this logic. This part is a bit tricky 
because the `FilterExec` code looks at `IsNotNull` expressions both for 
optimizing the order of expression evaluation _and_ for refining nullability to 
elide null checks in downstream operators.
   - Modify `logical.Project` so that inferred non-nullability information from 
child operators is preserved.
    
   ### Background on related historical changes / bugs
   
   While developing this patch, I found the following historical PRs to be 
useful references (note: many of these original PRs contained correctness bugs 
which were subsequently fixed in later PRs):
   
   - #10844 first introduced constraint inference and propagation in Spark SQL 
(including basic extraction of `IsNotNull` conditions from expressions).
   - #11585 modified `FilterExec` to update output nullability based on 
`IsNotNull` conditions.
   - #11792 modified `FilterExec` to add special handling for `IsNotNull` 
expression codegen, altering evaluation order to allow for better 
short-circuiting.
   - #11809 introduced the `NullIntolerant` trait to generalize the `IsNotNull` 
extraction logic.
   - #11810 updates `FilterExec`'s `IsNotNull`-handling path to use 
`NullIntolerant`
   - #15523 fixed a bug in `FilterExec`'s logic: it did not properly account 
for null-tolerant operators which were ancestors of `IsNotNull` expressions.
   - #16067 fixed a bug related to negation and `IsNotNull`.
   
   
   
   ## How was this patch tested?
   
   Added new tests for the added `PredicateHelper.getImpliedNotNullExprIds`.
   
   **TODO**: add new end-to-end tests reflecting the examples listed above (in 
order to properly test the integration of this new logic into `logical.Filter` 
and `logical.Project`).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to