[
https://issues.apache.org/jira/browse/SPARK-42038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678025#comment-17678025
]
Apache Spark commented on SPARK-42038:
--------------------------------------
User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/39633
> SPJ: Support partially clustered distribution
> ---------------------------------------------
>
> Key: SPARK-42038
> URL: https://issues.apache.org/jira/browse/SPARK-42038
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 3.3.1
> Reporter: Chao Sun
> Priority: Major
>
> Currently the storage-partitioned join requires both sides to be fully
> clustered on the partition values, that is, all input partitions reported by
> a V2 data source shall be grouped by partition values before the join
> happens. This could lead to data skew issues if a particular partition value
> is associated with a large amount of rows.
>
> To combat this, we can introduce the idea of partially clustered
> distribution, which means that only one side of the join is required to be
> fully clustered, while the other side is not. This allows Spark to increase
> the parallelism of the join and avoid the data skewness.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]