callmepandey commented on PR #53670: URL: https://github.com/apache/spark/pull/53670#issuecomment-3709246773
Hi @cloud-fan, I've opened this PR to fix an issue where null-aware anti-joins (from SPARK-32290) are not respecting the `spark.sql.autoBroadcastJoinThreshold` configuration, which can cause OOM errors when attempting to broadcast large tables. **The Problem:** When `spark.sql.optimizeNullAwareAntiJoin` is enabled, NOT IN subqueries always use `BroadcastHashJoinExec` regardless of the broadcast threshold setting, even when the right side is too large to broadcast safely. **The Fix:** Added a `canBroadcastBySize(j.right, conf)` check in SparkStrategies.scala:332 so that: - With broadcast enabled: Uses `BroadcastHashJoinExec` (optimized O(M) hash lookup) - With broadcast disabled: Falls back to `BroadcastNestedLoopJoinExec` (slower O(M×N), but avoids OOM) **Testing:** Added test case in JoinSuite.scala that verifies null-aware anti-joins respect the threshold configuration when broadcast is disabled. The PR is rebased on latest master and CI is passing. Would greatly appreciate your review since you signed-off on the original SPARK-32290 implementation. Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
