callmepandey opened a new pull request, #53670:
URL: https://github.com/apache/spark/pull/53670

   ### What changes were proposed in this pull request?
   
   This PR fixes an issue where null-aware anti-joins (enabled via 
`spark.sql.optimizeNullAwareAntiJoin`) were unconditionally using 
`BroadcastHashJoinExec` without checking if the right side was small enough to 
broadcast according to `spark.sql.autoBroadcastJoinThreshold`.
   
   ### Why are the changes needed?
   
   When `spark.sql.optimizeNullAwareAntiJoin` is enabled, queries using `NOT 
IN` with a subquery would always attempt to broadcast the right side, even when 
it exceeded the broadcast threshold. This could lead to OOM errors with large 
datasets.
   
   ### Does this PR introduce any user-facing change?
   
   Yes. When `spark.sql.autoBroadcastJoinThreshold` is set to -1 (or a small 
value), null-aware anti-joins will now respect this configuration and fall back 
to alternative join strategies instead of attempting to broadcast large tables.
   
   ### How was this patch tested?
   
   Added a new test case "SPARK-45846: optimizeNullAwareAntiJoin should respect 
autoBroadcastJoinThreshold" in JoinSuite that verifies null-aware anti-joins do 
not use BroadcastHashJoinExec when broadcast is disabled.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to