[ 
https://issues.apache.org/jira/browse/SPARK-56932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-56932:
-----------------------------------
    Labels: pull-request-available  (was: )

> Rewrite top-level single-column NOT IN subqueries to expose equi anti joins
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-56932
>                 URL: https://issues.apache.org/jira/browse/SPARK-56932
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 5.0.0
>            Reporter: Chao Sun
>            Priority: Major
>              Labels: pull-request-available
>
> Spark currently plans many top-level single-column NOT IN subqueries with a 
> null-aware anti join shape that can fall back to a BroadcastNestedLoopJoin. 
> This can be much more expensive than a regular equi anti join, especially for 
> larger inputs.
> For a supported top-level predicate of the form:
> {code:sql}
>   lhs NOT IN (SELECT rhs FROM ...)
> {code}
> the SQL semantics can be decomposed into two disjoint cases:
> 1. The RHS is empty.
>    - NOT IN evaluates to TRUE for every row.
> 2. The RHS is non-empty.
>    - The result is TRUE only when:
>      - lhs IS NOT NULL
>      - the RHS contains no NULL
>      - there is no equal RHS value for lhs
> This means the predicate can be rewritten as a UNION of:
> - an empty-RHS branch, and
> - a regular equi left-anti join branch with explicit null checks.
> Conceptually:
> {code:sql}
>   SELECT lhs_rows
>   WHERE NOT EXISTS (SELECT 1 FROM rhs)
>   UNION ALL
>   SELECT lhs_rows
>   WHERE lhs IS NOT NULL
>     AND EXISTS (SELECT 1 FROM rhs)
>     AND NOT EXISTS (SELECT 1 FROM rhs WHERE rhs IS NULL)
>     AND NOT EXISTS (SELECT 1 FROM rhs WHERE lhs = rhs)
> {code}
> The second branch exposes a standard equality anti join, which gives the 
> optimizer a much better plan shape than a nested-loop anti join.
> This optimization should be conservative and only apply when it is 
> semantically safe, for example:
> - top-level filter predicate only
> - single-column NOT IN
> - uncorrelated subquery
> - deterministic left-hand expression
> - deterministic duplicated RHS subplan
> - no LIMIT, OFFSET, or similar RHS constructs whose duplication could change 
> semantics
> The rewrite should remain behind a SQL configuration flag initially.
> Motivation / Example:
> For a query like:
> {code:sql}
>   SELECT *
>   FROM fact_events e
>   WHERE e.user_id NOT IN (
>     SELECT blocked_user_id
>     FROM blocked_users
>   )
> {code}
> Spark may otherwise choose a null-aware nested-loop anti join shape.
> After the rewrite, the non-empty-RHS branch can use an equi anti join on:
> {code:sql}
>   e.user_id = blocked_user_id
> {code}
> while preserving correct NOT IN null semantics through the explicit 
> null-presence checks.
> This should improve plan quality for affected queries while preserving 
> correctness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to