[
https://issues.apache.org/jira/browse/SPARK-27829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenchen Fan resolved SPARK-27829.
---------------------------------
Resolution: Fixed
Fix Version/s: 3.0.0
Issue resolved by pull request 24693
[https://github.com/apache/spark/pull/24693]
> In Dataset.joinWith inner joins, don't nest data before shuffling
> -----------------------------------------------------------------
>
> Key: SPARK-27829
> URL: https://issues.apache.org/jira/browse/SPARK-27829
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.4.0
> Reporter: Josh Rosen
> Assignee: Josh Rosen
> Priority: Major
> Fix For: 3.0.0
>
>
> In order to support outer joins with null top-level objects, SPARK-15441
> modified Dataset.joinWith to project both inputs into single-column structs
> prior to the join.
> For inner joins, however, this step is unnecessary and actually harms
> performance: performing the nesting before the join increases the shuffled
> data size. As an optimization for inner joins only, we can move this nesting
> to occur after the join (effectively switching back to the pre- SPARK-15441
> behavior).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]