Chao Sun created SPARK-57282:
--------------------------------

             Summary: Spread NULL left anti join keys across shuffle partitions
                 Key: SPARK-57282
                 URL: https://issues.apache.org/jira/browse/SPARK-57282
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.2.0
            Reporter: Chao Sun
            Assignee: Chao Sun


SPARK-56903 added feature-flagged NULL-key spreading for shuffled outer 
equi-joins. The review discussion explicitly identified ordinary LEFT ANTI 
joins as a follow-up because the same structural condition applies: left-side 
rows with NULL keys cannot match under `=`, but they must be emitted by the 
anti join. Hash partitioning currently sends all of those rows to one reducer, 
causing avoidable skew on NULL-heavy inputs.

Extend `spark.sql.shuffle.spreadNullJoinKeys.enabled` to ordinary shuffled LEFT 
ANTI equi-joins when the preserved left-side join keys are nullable. Non-NULL 
keys should retain normal hash partitioning, while NULL-keyed rows can use the 
existing null-aware shuffle layout. LEFT SEMI and other existence joins remain 
out of scope because they do not emit unmatched left rows.

Add correctness and physical-partitioning coverage for sort-merge join and 
shuffled hash join, plus AQE coverage for coalesced null-aware partitioning.

Related discussion: 
https://github.com/apache/spark/pull/55927#discussion_r3261128231



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to