[
https://issues.apache.org/jira/browse/SPARK-57648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-57648:
-----------------------------------
Labels: pull-request-available (was: )
> Spread provably unmatchable left outer join rows across shuffle partitions
> --------------------------------------------------------------------------
>
> Key: SPARK-57648
> URL: https://issues.apache.org/jira/browse/SPARK-57648
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: Chao Sun
> Assignee: Chao Sun
> Priority: Major
> Labels: pull-request-available
>
> A shuffled LEFT OUTER equi-join can have a residual ON predicate that
> references only the preserved left input, for example:
> {code:sql}
> SELECT *
> FROM left_table l
> LEFT OUTER JOIN right_table r
> ON l.k = r.k AND l.eligible
> {code}
> Rows where `l.eligible` is FALSE or NULL are provably unable to match any
> right-side row, but LEFT OUTER semantics still require Spark to emit them as
> unmatched rows. Spark currently shuffles only by the equi-join key. If many
> rejected rows share a common non-NULL key, they are all sent to one reducer
> even though they do not need to be co-located, causing avoidable skew.
> When `spark.sql.shuffle.spreadNullJoinKeys.enabled` is enabled, Spark can
> reuse the existing null-aware shuffle path for a conservative safe subset of
> these joins. For an eligible join, append a physical guard key:
> * left: `IF(residual_condition, TRUE, NULL)`
> * right: `TRUE`
> Rows for which the residual is TRUE retain normal hash co-location on the
> original equi-join keys plus TRUE. Rows for which it is FALSE or NULL receive
> a NULL guard and can be spread by `NullAwareHashPartitioning`. The original
> residual predicate remains on the join as a defensive correctness check.
> The optimization should be limited to shuffled LEFT OUTER joins where the
> residual is deterministic, references only the left input, and belongs to a
> conservative expression subset that is safe to evaluate before the shuffle.
> It must not change broadcast joins, predicates that reference the right side,
> nondeterministic predicates, or other join types.
> Tests should cover sort-merge and shuffled-hash joins, result correctness for
> TRUE/FALSE/NULL residual values, physical null-aware partitioning,
> disabled/unsupported negative cases, and preservation of all synthetic
> physical join keys during distribution satisfaction.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]