JoshRosen opened a new pull request #24765: [SPARK-27915][SQL][WIP] Update logical Filter's output nullability based on IsNotNull conditions URL: https://github.com/apache/spark/pull/24765 ## What changes were proposed in this pull request? This PR changes the logical `Filter` operator to update its outputs' nullability when filter conditions imply that outputs cannot be null. In addition, I refined similar existing logic in the physical `FilterExec` (changing the existing code to be more precise / less conservative in its non-nullability inference) and improved propagation of inferred nullability information in `Project`. This is useful because of how it composes with other optimizations: Spark has several logical and physical optimizations which leverage non-nullability, so improving nullability inference increases the value of those existing optimizations. :warning: **Disclaimers** :warning: - This is a work-in-progress / skunkworks side project; I'm not working on this full time. - Nullability has been a major source of bugs in the past: this PR requires careful review. - I haven't run analyzer / optimizer performance benchmarks, so there's a decent chance that this WIP changeset regresses query planning performance. - DataFrames / Datasets / queries' `.schema` may change as a result of this optimization: this may have consequences in case nullability information is used by downstream systems (e.g. for `CREATE TABLE` DDL). - The schemas of analyzed and optimized logical plans may now differ in terms of field nullability (because optimization might infer additional constraints which allow us to prove that fields are non-null). ### Examples Consider the query ``` SELECT key FROM t WHERE key IS NOT NULL ``` where `t.key` is nullable. Because of the `key IS NOT NULL` filter condition, `key` will always be non-null. Prior to this patch, this query's result schema was overly-conservative, continuing to mark `key` as nullable. However, if we take advantage of the `key IS NOT NULL` condition we can set `nullable = false` for `key`. This was a somewhat trivial example, so let's look at some more complicated cases: Consider ``` SELECT A.key, A.value FROM A, B WHERE A.key = B.key AND (A.num + B.num) > 0 ``` where all columns of `A` and `B` are nullable. Because of the equality join condition, we know that `key` must be non-null in both tables. In addition, the condition `(A.num + B.num) > 0` can only hold if both `num` values are not null: addition is a _null-intolerant_ operator, meaning that it returns `null` if any of its operands is null. Leveraging this, we should be able to mark both `key` and `value` as non-null in the join result's schema (even though both values are nullable in the underlying input relation). Finally, let's look at an example of a **non** null-intolerant operator: `coalesce(a, b) IS NOT NULL` could still mean that `a` or `b` is null, so in ``` SELECT key, foo, COALESCE(foo, bar) as qux FROM A WHERE COALESCE(foo, bar) > 0 ``` we can infer that `qux` is not null but cannot make any claims about `value`'s nullability. ### Description of changes - Introduce `PredicateHelper.getImpliedNotNullExprIds(IsNotNull)` helper, which takes an `IsNotNull` expression and returns the `ExprId`s of expressions which cannot be null. This handles simple cases like `IsNotNull(columnFromTable)`, as well as more complex cases involving expression trees (properly accounting for null-(in)tolerance). - There was similar existing logic in `FilterExec`, but I think it was overly conservative: given `IsNotNull(x)`, it would claim that `x` _and all of its descendants_ were not null if and only if every ancestor of `x` was `NullIntolerant`. However, even if `x` is null-tolerant we can still make claims about `x`'s non-nullability even if we can't make further claims about its children. - Update `logical.Filter` to leverage this new function to update output nullability. - Modify `FilterExec` to re-use this logic. This part is a bit tricky because the `FilterExec` code looks at `IsNotNull` expressions both for optimizing the order of expression evaluation _and_ for refining nullability to elide null checks in downstream operators. - Modify `logical.Project` so that inferred non-nullability information from child operators is preserved. ### Background on related historical changes / bugs While developing this patch, I found the following historical PRs to be useful references (note: many of these original PRs contained correctness bugs which were subsequently fixed in later PRs): - #10844 first introduced constraint inference and propagation in Spark SQL (including basic extraction of `IsNotNull` conditions from expressions). - #11585 modified `FilterExec` to update output nullability based on `IsNotNull` conditions. - #11792 modified `FilterExec` to add special handling for `IsNotNull` expression codegen, altering evaluation order to allow for better short-circuiting. - #11809 introduced the `NullIntolerant` trait to generalize the `IsNotNull` extraction logic. - #11810 updates `FilterExec`'s `IsNotNull`-handling path to use `NullIntolerant` - #15523 fixed a bug in `FilterExec`'s logic: it did not properly account for null-tolerant operators which were ancestors of `IsNotNull` expressions. - #16067 fixed a bug related to negation and `IsNotNull`. ## How was this patch tested? Added new tests for the added `PredicateHelper.getImpliedNotNullExprIds`. **TODO**: add new end-to-end tests reflecting the examples listed above (in order to properly test the integration of this new logic into `logical.Filter` and `logical.Project`).
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
