callmepandey opened a new pull request, #53670: URL: https://github.com/apache/spark/pull/53670
### What changes were proposed in this pull request? This PR fixes an issue where null-aware anti-joins (enabled via `spark.sql.optimizeNullAwareAntiJoin`) were unconditionally using `BroadcastHashJoinExec` without checking if the right side was small enough to broadcast according to `spark.sql.autoBroadcastJoinThreshold`. ### Why are the changes needed? When `spark.sql.optimizeNullAwareAntiJoin` is enabled, queries using `NOT IN` with a subquery would always attempt to broadcast the right side, even when it exceeded the broadcast threshold. This could lead to OOM errors with large datasets. ### Does this PR introduce any user-facing change? Yes. When `spark.sql.autoBroadcastJoinThreshold` is set to -1 (or a small value), null-aware anti-joins will now respect this configuration and fall back to alternative join strategies instead of attempting to broadcast large tables. ### How was this patch tested? Added a new test case "SPARK-45846: optimizeNullAwareAntiJoin should respect autoBroadcastJoinThreshold" in JoinSuite that verifies null-aware anti-joins do not use BroadcastHashJoinExec when broadcast is disabled. ### Was this patch authored or co-authored using generative AI tooling? Yes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
