[
https://issues.apache.org/jira/browse/SPARK-42038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun resolved SPARK-42038.
-----------------------------------
Fix Version/s: 3.4.0
Resolution: Fixed
Issue resolved by pull request 39633
[https://github.com/apache/spark/pull/39633]
> SPJ: Support partially clustered distribution
> ---------------------------------------------
>
> Key: SPARK-42038
> URL: https://issues.apache.org/jira/browse/SPARK-42038
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 3.3.1
> Reporter: Chao Sun
> Assignee: Chao Sun
> Priority: Major
> Fix For: 3.4.0
>
>
> Currently the storage-partitioned join requires both sides to be fully
> clustered on the partition values, that is, all input partitions reported by
> a V2 data source shall be grouped by partition values before the join
> happens. This could lead to data skew issues if a particular partition value
> is associated with a large amount of rows.
>
> To combat this, we can introduce the idea of partially clustered
> distribution, which means that only one side of the join is required to be
> fully clustered, while the other side is not. This allows Spark to increase
> the parallelism of the join and avoid the data skewness.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]