sunchao opened a new pull request, #56348: URL: https://github.com/apache/spark/pull/56348
### What changes were proposed in this pull request? Extend `spark.sql.shuffle.spreadNullJoinKeys.enabled` to shuffled `LEFT ANTI` equi-joins when the preserved left-side join keys are nullable. The planner requests the existing null-aware clustered distribution for eligible left anti joins. Non-NULL keys retain normal hash placement, while NULL keys may be spread across shuffle partitions. This PR also updates the configuration documentation. The tests cover sort-merge and shuffled-hash left anti joins, including result correctness and null-aware shuffle partitioning, plus AQE coalescing of the resulting partitioning. This follows the `LEFT ANTI` discussion in https://github.com/apache/spark/pull/55927. ### Why are the changes needed? For an ordinary `LEFT ANTI` equi-join, rows with NULL keys on the preserved left side cannot match and must be emitted. Standard hash partitioning sends all of those rows to the same reducer, which can create severe shuffle skew. Spreading the NULL-keyed rows only changes their physical placement and therefore reduces this skew without changing join results. ### Does this PR introduce _any_ user-facing change? Yes, but only when `spark.sql.shuffle.spreadNullJoinKeys.enabled` is enabled. Eligible shuffled left anti joins may spread NULL-keyed preserved rows across shuffle partitions. Query results are unchanged, and the configuration remains disabled by default. ### How was this patch tested? - `JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home ./build/sbt "sql/testOnly org.apache.spark.sql.execution.joins.ExistenceJoinSuite"` (118 tests passed) - `JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home ./build/sbt "sql/testOnly org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z 'SPARK-57282: spread NULL keys for left anti join'"` (1 test passed) - `JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home ./dev/lint-scala` - `git diff --check` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Codex GPT-5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
