[ 
https://issues.apache.org/jira/browse/SPARK-57648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-57648:
-----------------------------------
    Labels: pull-request-available  (was: )

> Spread provably unmatchable left outer join rows across shuffle partitions
> --------------------------------------------------------------------------
>
>                 Key: SPARK-57648
>                 URL: https://issues.apache.org/jira/browse/SPARK-57648
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>            Priority: Major
>              Labels: pull-request-available
>
> A shuffled LEFT OUTER equi-join can have a residual ON predicate that 
> references only the preserved left input, for example:
> {code:sql}
> SELECT *
> FROM left_table l
> LEFT OUTER JOIN right_table r
>   ON l.k = r.k AND l.eligible
> {code}
> Rows where `l.eligible` is FALSE or NULL are provably unable to match any 
> right-side row, but LEFT OUTER semantics still require Spark to emit them as 
> unmatched rows. Spark currently shuffles only by the equi-join key. If many 
> rejected rows share a common non-NULL key, they are all sent to one reducer 
> even though they do not need to be co-located, causing avoidable skew.
> When `spark.sql.shuffle.spreadNullJoinKeys.enabled` is enabled, Spark can 
> reuse the existing null-aware shuffle path for a conservative safe subset of 
> these joins. For an eligible join, append a physical guard key:
> * left: `IF(residual_condition, TRUE, NULL)`
> * right: `TRUE`
> Rows for which the residual is TRUE retain normal hash co-location on the 
> original equi-join keys plus TRUE. Rows for which it is FALSE or NULL receive 
> a NULL guard and can be spread by `NullAwareHashPartitioning`. The original 
> residual predicate remains on the join as a defensive correctness check.
> The optimization should be limited to shuffled LEFT OUTER joins where the 
> residual is deterministic, references only the left input, and belongs to a 
> conservative expression subset that is safe to evaluate before the shuffle. 
> It must not change broadcast joins, predicates that reference the right side, 
> nondeterministic predicates, or other join types.
> Tests should cover sort-merge and shuffled-hash joins, result correctness for 
> TRUE/FALSE/NULL residual values, physical null-aware partitioning, 
> disabled/unsupported negative cases, and preservation of all synthetic 
> physical join keys during distribution satisfaction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to