Chao Sun created SPARK-57648:
--------------------------------
Summary: Spread provably unmatchable left outer join rows across
shuffle partitions
Key: SPARK-57648
URL: https://issues.apache.org/jira/browse/SPARK-57648
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.2.0
Reporter: Chao Sun
Assignee: Chao Sun
A shuffled LEFT OUTER equi-join can have a residual ON predicate that
references only the preserved left input, for example:
{code:sql}
SELECT *
FROM left_table l
LEFT OUTER JOIN right_table r
ON l.k = r.k AND l.eligible
{code}
Rows where `l.eligible` is FALSE or NULL are provably unable to match any
right-side row, but LEFT OUTER semantics still require Spark to emit them as
unmatched rows. Spark currently shuffles only by the equi-join key. If many
rejected rows share a common non-NULL key, they are all sent to one reducer
even though they do not need to be co-located, causing avoidable skew.
When `spark.sql.shuffle.spreadNullJoinKeys.enabled` is enabled, Spark can reuse
the existing null-aware shuffle path for a conservative safe subset of these
joins. For an eligible join, append a physical guard key:
* left: `IF(residual_condition, TRUE, NULL)`
* right: `TRUE`
Rows for which the residual is TRUE retain normal hash co-location on the
original equi-join keys plus TRUE. Rows for which it is FALSE or NULL receive a
NULL guard and can be spread by `NullAwareHashPartitioning`. The original
residual predicate remains on the join as a defensive correctness check.
The optimization should be limited to shuffled LEFT OUTER joins where the
residual is deterministic, references only the left input, and belongs to a
conservative expression subset that is safe to evaluate before the shuffle. It
must not change broadcast joins, predicates that reference the right side,
nondeterministic predicates, or other join types.
Tests should cover sort-merge and shuffled-hash joins, result correctness for
TRUE/FALSE/NULL residual values, physical null-aware partitioning,
disabled/unsupported negative cases, and preservation of all synthetic physical
join keys during distribution satisfaction.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]