[ 
https://issues.apache.org/jira/browse/SPARK-27829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27829.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 3.0.0

Issue resolved by pull request 24693
[https://github.com/apache/spark/pull/24693]

> In Dataset.joinWith inner joins, don't nest data before shuffling
> -----------------------------------------------------------------
>
>                 Key: SPARK-27829
>                 URL: https://issues.apache.org/jira/browse/SPARK-27829
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>            Priority: Major
>             Fix For: 3.0.0
>
>
> In order to support outer joins with null top-level objects, SPARK-15441 
> modified Dataset.joinWith to project both inputs into single-column structs 
> prior to the join.
> For inner joins, however, this step is unnecessary and actually harms 
> performance: performing the nesting before the join increases the shuffled 
> data size. As an optimization for inner joins only, we can move this nesting 
> to occur after the join (effectively switching back to the pre- SPARK-15441 
> behavior). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to