callmepandey commented on PR #53670:
URL: https://github.com/apache/spark/pull/53670#issuecomment-3709246773

   Hi @cloud-fan,
   
   I've opened this PR to fix an issue where null-aware anti-joins (from 
SPARK-32290) are not respecting the `spark.sql.autoBroadcastJoinThreshold` 
configuration, which can cause OOM errors when attempting to broadcast large 
tables.
   
   **The Problem:**
   When `spark.sql.optimizeNullAwareAntiJoin` is enabled, NOT IN subqueries 
always use `BroadcastHashJoinExec` regardless of the broadcast threshold 
setting, even when the right side is too large to broadcast safely.
   
   **The Fix:**
   Added a `canBroadcastBySize(j.right, conf)` check in 
SparkStrategies.scala:332 so that:
   - With broadcast enabled: Uses `BroadcastHashJoinExec` (optimized O(M) hash 
lookup) 
   - With broadcast disabled: Falls back to `BroadcastNestedLoopJoinExec` 
(slower O(M×N), but avoids OOM)
   
   **Testing:**
   Added test case in JoinSuite.scala that verifies null-aware anti-joins 
respect the threshold configuration when broadcast is disabled.
   
   The PR is rebased on latest master and CI is passing. Would greatly 
appreciate your review since you signed-off on the original SPARK-32290 
implementation.
   
   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to