Chao Sun created SPARK-56932:
--------------------------------
Summary: Rewrite top-level single-column NOT IN subqueries to
expose equi anti joins
Key: SPARK-56932
URL: https://issues.apache.org/jira/browse/SPARK-56932
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 5.0.0
Reporter: Chao Sun
Spark currently plans many top-level single-column NOT IN subqueries with a
null-aware anti join shape that can fall back to a BroadcastNestedLoopJoin.
This can be much more expensive than a regular equi anti join, especially for
larger inputs.
For a supported top-level predicate of the form:
{code:sql}
lhs NOT IN (SELECT rhs FROM ...)
{code}
the SQL semantics can be decomposed into two disjoint cases:
1. The RHS is empty.
- NOT IN evaluates to TRUE for every row.
2. The RHS is non-empty.
- The result is TRUE only when:
- lhs IS NOT NULL
- the RHS contains no NULL
- there is no equal RHS value for lhs
This means the predicate can be rewritten as a UNION of:
- an empty-RHS branch, and
- a regular equi left-anti join branch with explicit null checks.
Conceptually:
{code:sql}
SELECT lhs_rows
WHERE NOT EXISTS (SELECT 1 FROM rhs)
UNION ALL
SELECT lhs_rows
WHERE lhs IS NOT NULL
AND EXISTS (SELECT 1 FROM rhs)
AND NOT EXISTS (SELECT 1 FROM rhs WHERE rhs IS NULL)
AND NOT EXISTS (SELECT 1 FROM rhs WHERE lhs = rhs)
{code}
The second branch exposes a standard equality anti join, which gives the
optimizer a much better plan shape than a nested-loop anti join.
This optimization should be conservative and only apply when it is semantically
safe, for example:
- top-level filter predicate only
- single-column NOT IN
- uncorrelated subquery
- deterministic left-hand expression
- deterministic duplicated RHS subplan
- no LIMIT, OFFSET, or similar RHS constructs whose duplication could change
semantics
The rewrite should remain behind a SQL configuration flag initially.
Motivation / Example:
For a query like:
{code:sql}
SELECT *
FROM fact_events e
WHERE e.user_id NOT IN (
SELECT blocked_user_id
FROM blocked_users
)
{code}
Spark may otherwise choose a null-aware nested-loop anti join shape.
After the rewrite, the non-empty-RHS branch can use an equi anti join on:
{code:sql}
e.user_id = blocked_user_id
{code}
while preserving correct NOT IN null semantics through the explicit
null-presence checks.
This should improve plan quality for affected queries while preserving
correctness.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]