kosiew opened a new issue, #22868: URL: https://github.com/apache/datafusion/issues/22868
## Summary `datafusion/optimizer/src/eliminate_outer_join.rs` [currently](https://github.com/apache/datafusion/tree/d77a02d662b27852cc247153d9edcad5a92905d4) tracks null-rejection evidence as `Vec<Column>` and later reduces those columns to two booleans: whether the predicate rejects NULLs from the join's left side, right side, or both. This is more indirect than the optimizer needs. The join-elimination decision does not depend on which specific column was null-rejecting; it only depends on which join side was null-rejecting. Refactor the analysis to represent that invariant directly. ## Current code In `try_simplify_join`: - `null_rejecting_cols: Vec<Column>` is built. - `extract_null_rejecting_columns(...)` appends column references into that vector. - The caller loops over those columns and calls `join.left.schema().has_column(col)` / `join.right.schema().has_column(col)` to derive `left_non_nullable` and `right_non_nullable`. - `eliminate_outer(...)` only consumes those side-level booleans. In `extract_null_rejecting_columns`: - `Expr::Column` pushes the column into the vector. - Top-level `AND` appends evidence from both branches. - `OR` and nested `AND` compute branch-local column vectors, then push one representative column for each side where both branches reject NULLs. The current logic works, but the data model is column-driven while the decision is side-driven. ## Refactor goal Replace the intermediate `Vec<Column>` contract with explicit side-level state, for example: ```rust struct NullRejectingSides { left: bool, right: bool, } ``` The helper should return or update `NullRejectingSides` directly. `try_simplify_join` should then pass those booleans to `eliminate_outer(...)` without re-scanning collected columns. ## Why this is a good refactor - **Clearer invariant:** code states directly that outer-join elimination depends on null-rejection of join sides, not individual column identities. - **Less indirection:** removes the collect-columns-then-reclassify-columns step. - **Less repeated work:** avoids repeated `has_column` scans after extraction. - **Clearer OR handling:** branch combination can be expressed as side-level intersection instead of selecting representative columns. - **Easier maintenance:** future changes can reason about `left` / `right` null-rejection directly. ## Required behavior preservation This must be a behavior-preserving cleanup. Do not change SQL-visible semantics or expected optimized plans. Preserve these rules: - Top-level `AND` in a `WHERE` predicate unions null-rejection evidence from both conjuncts. - `OR` credits a side only when both branches independently reject NULLs from that side. - Nested `AND` follows the same side-intersection behavior as `OR`, matching current conservative handling. - `IS NOT NULL`, `IS TRUE`, `IS FALSE`, and `IS NOT UNKNOWN` only recurse when allowed by the current `top_level` guard. - Operators using `returns_null_on_null()` continue to recurse as they do today. - `IS NULL`, `IS NOT TRUE`, `IS NOT FALSE`, `IS UNKNOWN`, function calls, literals, subqueries, and other currently skipped expressions remain skipped. - Qualified or ambiguous column handling remains equivalent to the current `DFSchema::has_column`-based side-membership checks. ## Suggested implementation shape Possible helper methods: ```rust impl NullRejectingSides { fn union(self, other: Self) -> Self { ... } fn intersection(self, other: Self) -> Self { ... } } ``` Possible extraction flow: - `Expr::Column(col)` returns side membership based on `left_schema.has_column(col)` and `right_schema.has_column(col)`. - Top-level `AND` returns `left_sides.union(right_sides)`. - `OR` and nested `AND` return `left_sides.intersection(right_sides)`. - Null-propagating operators recurse and union child evidence as current column collection does. Keep the refactor local to `eliminate_outer_join.rs` unless a small helper extraction clearly improves readability. ## Non-goals - No optimizer behavior change. - No new outer-join elimination cases. - No changes outside `eliminate_outer_join` unless mechanically necessary. - No public API changes. ## Acceptance criteria - `try_simplify_join` no longer builds or consumes `Vec<Column>` for null-rejection analysis. - Side-level null-rejection state is explicit in the code. - Existing eliminate_outer_join behavior and expected plans are unchanged. - OR / nested-AND side-intersection behavior remains intact. - Current top-level guard behavior remains intact. - Code is simpler to read than the current collect-and-reduce flow. ## Test strategy Run the existing focused optimizer tests: ```bash cargo test -p datafusion-optimizer eliminate_outer_join --lib ``` Ensure existing coverage still exercises: - LEFT / RIGHT / FULL outer join conversion under null-rejecting predicates. - No-conversion cases under null-accepting predicates. - OR predicates that reject one side or both sides. - Top-level guard behavior for `IS NOT NULL`, `IS TRUE`, `IS FALSE`, and `IS NOT UNKNOWN`. If the helper becomes easy to test directly, optionally add focused unit tests for side-level union/intersection behavior. ## Impact This refactor should reduce conceptual complexity without changing optimizer output. It encodes the true invariant directly: outer-join elimination is driven by whether the filter rejects null-padded rows from the left side, the right side, or both. ## Related PR #22444. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
