Chao Sun created SPARK-56932:
--------------------------------

             Summary: Rewrite top-level single-column NOT IN subqueries to 
expose equi anti joins
                 Key: SPARK-56932
                 URL: https://issues.apache.org/jira/browse/SPARK-56932
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 5.0.0
            Reporter: Chao Sun


Spark currently plans many top-level single-column NOT IN subqueries with a 
null-aware anti join shape that can fall back to a BroadcastNestedLoopJoin. 
This can be much more expensive than a regular equi anti join, especially for 
larger inputs.

For a supported top-level predicate of the form:


{code:sql}
  lhs NOT IN (SELECT rhs FROM ...)
{code}

the SQL semantics can be decomposed into two disjoint cases:

1. The RHS is empty.
   - NOT IN evaluates to TRUE for every row.

2. The RHS is non-empty.
   - The result is TRUE only when:
     - lhs IS NOT NULL
     - the RHS contains no NULL
     - there is no equal RHS value for lhs

This means the predicate can be rewritten as a UNION of:

- an empty-RHS branch, and
- a regular equi left-anti join branch with explicit null checks.

Conceptually:


{code:sql}
  SELECT lhs_rows
  WHERE NOT EXISTS (SELECT 1 FROM rhs)

  UNION ALL

  SELECT lhs_rows
  WHERE lhs IS NOT NULL
    AND EXISTS (SELECT 1 FROM rhs)
    AND NOT EXISTS (SELECT 1 FROM rhs WHERE rhs IS NULL)
    AND NOT EXISTS (SELECT 1 FROM rhs WHERE lhs = rhs)

{code}

The second branch exposes a standard equality anti join, which gives the 
optimizer a much better plan shape than a nested-loop anti join.

This optimization should be conservative and only apply when it is semantically 
safe, for example:

- top-level filter predicate only
- single-column NOT IN
- uncorrelated subquery
- deterministic left-hand expression
- deterministic duplicated RHS subplan
- no LIMIT, OFFSET, or similar RHS constructs whose duplication could change 
semantics

The rewrite should remain behind a SQL configuration flag initially.

Motivation / Example:
For a query like:


{code:sql}
  SELECT *
  FROM fact_events e
  WHERE e.user_id NOT IN (
    SELECT blocked_user_id
    FROM blocked_users
  )
{code}

Spark may otherwise choose a null-aware nested-loop anti join shape.

After the rewrite, the non-empty-RHS branch can use an equi anti join on:


{code:sql}
  e.user_id = blocked_user_id

{code}

while preserving correct NOT IN null semantics through the explicit 
null-presence checks.

This should improve plan quality for affected queries while preserving 
correctness.








--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to