[
https://issues.apache.org/jira/browse/SPARK-56932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-56932:
-----------------------------------
Labels: pull-request-available (was: )
> Rewrite top-level single-column NOT IN subqueries to expose equi anti joins
> ---------------------------------------------------------------------------
>
> Key: SPARK-56932
> URL: https://issues.apache.org/jira/browse/SPARK-56932
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 5.0.0
> Reporter: Chao Sun
> Priority: Major
> Labels: pull-request-available
>
> Spark currently plans many top-level single-column NOT IN subqueries with a
> null-aware anti join shape that can fall back to a BroadcastNestedLoopJoin.
> This can be much more expensive than a regular equi anti join, especially for
> larger inputs.
> For a supported top-level predicate of the form:
> {code:sql}
> lhs NOT IN (SELECT rhs FROM ...)
> {code}
> the SQL semantics can be decomposed into two disjoint cases:
> 1. The RHS is empty.
> - NOT IN evaluates to TRUE for every row.
> 2. The RHS is non-empty.
> - The result is TRUE only when:
> - lhs IS NOT NULL
> - the RHS contains no NULL
> - there is no equal RHS value for lhs
> This means the predicate can be rewritten as a UNION of:
> - an empty-RHS branch, and
> - a regular equi left-anti join branch with explicit null checks.
> Conceptually:
> {code:sql}
> SELECT lhs_rows
> WHERE NOT EXISTS (SELECT 1 FROM rhs)
> UNION ALL
> SELECT lhs_rows
> WHERE lhs IS NOT NULL
> AND EXISTS (SELECT 1 FROM rhs)
> AND NOT EXISTS (SELECT 1 FROM rhs WHERE rhs IS NULL)
> AND NOT EXISTS (SELECT 1 FROM rhs WHERE lhs = rhs)
> {code}
> The second branch exposes a standard equality anti join, which gives the
> optimizer a much better plan shape than a nested-loop anti join.
> This optimization should be conservative and only apply when it is
> semantically safe, for example:
> - top-level filter predicate only
> - single-column NOT IN
> - uncorrelated subquery
> - deterministic left-hand expression
> - deterministic duplicated RHS subplan
> - no LIMIT, OFFSET, or similar RHS constructs whose duplication could change
> semantics
> The rewrite should remain behind a SQL configuration flag initially.
> Motivation / Example:
> For a query like:
> {code:sql}
> SELECT *
> FROM fact_events e
> WHERE e.user_id NOT IN (
> SELECT blocked_user_id
> FROM blocked_users
> )
> {code}
> Spark may otherwise choose a null-aware nested-loop anti join shape.
> After the rewrite, the non-empty-RHS branch can use an equi anti join on:
> {code:sql}
> e.user_id = blocked_user_id
> {code}
> while preserving correct NOT IN null semantics through the explicit
> null-presence checks.
> This should improve plan quality for affected queries while preserving
> correctness.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]