Chao Sun created SPARK-57648:
--------------------------------

             Summary: Spread provably unmatchable left outer join rows across 
shuffle partitions
                 Key: SPARK-57648
                 URL: https://issues.apache.org/jira/browse/SPARK-57648
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.2.0
            Reporter: Chao Sun
            Assignee: Chao Sun


A shuffled LEFT OUTER equi-join can have a residual ON predicate that 
references only the preserved left input, for example:

{code:sql}
SELECT *
FROM left_table l
LEFT OUTER JOIN right_table r
  ON l.k = r.k AND l.eligible
{code}

Rows where `l.eligible` is FALSE or NULL are provably unable to match any 
right-side row, but LEFT OUTER semantics still require Spark to emit them as 
unmatched rows. Spark currently shuffles only by the equi-join key. If many 
rejected rows share a common non-NULL key, they are all sent to one reducer 
even though they do not need to be co-located, causing avoidable skew.

When `spark.sql.shuffle.spreadNullJoinKeys.enabled` is enabled, Spark can reuse 
the existing null-aware shuffle path for a conservative safe subset of these 
joins. For an eligible join, append a physical guard key:

* left: `IF(residual_condition, TRUE, NULL)`
* right: `TRUE`

Rows for which the residual is TRUE retain normal hash co-location on the 
original equi-join keys plus TRUE. Rows for which it is FALSE or NULL receive a 
NULL guard and can be spread by `NullAwareHashPartitioning`. The original 
residual predicate remains on the join as a defensive correctness check.

The optimization should be limited to shuffled LEFT OUTER joins where the 
residual is deterministic, references only the left input, and belongs to a 
conservative expression subset that is safe to evaluate before the shuffle. It 
must not change broadcast joins, predicates that reference the right side, 
nondeterministic predicates, or other join types.

Tests should cover sort-merge and shuffled-hash joins, result correctness for 
TRUE/FALSE/NULL residual values, physical null-aware partitioning, 
disabled/unsupported negative cases, and preservation of all synthetic physical 
join keys during distribution satisfaction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to