sunchao opened a new pull request, #56719:
URL: https://github.com/apache/spark/pull/56719

   ### Why are the changes needed?
   
   A shuffled `LEFT OUTER` equi-join can retain a residual `ON` predicate that 
references only the preserved left input:
   
   ```sql
   SELECT *
   FROM left_table l
   LEFT OUTER JOIN right_table r
     ON l.k = r.k AND l.eligible
   ```
   
   Rows where `l.eligible` is false or null cannot match any right-side row, 
but the join must still emit them as unmatched rows. Spark currently shuffles 
only by `k`, so a common non-null key can funnel all such rows into one reducer 
even though they do not need to be co-located.
   
   This addresses 
[SPARK-57648](https://issues.apache.org/jira/browse/SPARK-57648).
   
   ### What changes were proposed in this PR?
   
   For shuffled left outer joins, when 
`spark.sql.shuffle.spreadNullJoinKeys.enabled` is enabled, this PR recognizes a 
conservative subset of deterministic residual predicates that reference only 
the left input.
   
   For eligible joins it appends a physical guard key:
   
   - left: `IF(residual_condition, TRUE, NULL)`
   - right: `TRUE`
   
   Rows for which the residual is true retain normal hash co-location. Rows for 
which it is false or null receive a null guard and use the existing 
`NullAwareHashPartitioning` path, allowing them to spread across reducers. The 
original residual remains on the join.
   
   A tag on the synthetic guard makes the shuffled join require every physical 
join key, preventing an existing partitioning on only the original equi-join 
keys from bypassing the guard. Ordinary null-aware joins keep their existing 
distribution behavior.
   
   ### How was this PR tested?
   
   - `./build/sbt "sql/testOnly org.apache.spark.sql.JoinSuite -- -z 
SPARK-57648"` — 3 tests passed.
   - `./build/sbt "sql/testOnly 
org.apache.spark.sql.connector.KeyGroupedPartitioningSuite -- -z SPARK-42038"` 
— 9 tests passed.
   - `./build/sbt "sql/testOnly 
org.apache.spark.sql.execution.joins.OuterJoinSuite -- -z ordinary"` — 8 tests 
passed.
   - `./build/sbt "sql/testOnly 
org.apache.spark.sql.execution.joins.ExistenceJoinSuite -- -z ordinary"` — 3 
tests passed.
   - `./dev/lint-scala` — Scalastyle and Scalafmt passed.
   - `git diff --check`
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, on `master` only when `spark.sql.shuffle.spreadNullJoinKeys.enabled` is 
enabled. Eligible shuffled left outer joins may distribute provably unmatchable 
rows across multiple shuffle partitions. Query results are unchanged, and the 
configuration remains disabled by default.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Codex GPT-5
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to