Chao Sun created SPARK-42038:
--------------------------------

             Summary: SPJ: Support partially clustered distribution
                 Key: SPARK-42038
                 URL: https://issues.apache.org/jira/browse/SPARK-42038
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 3.3.1
            Reporter: Chao Sun


Currently the storage-partitioned join requires both sides to be fully 
clustered on the partition values, that is, all input partitions reported by a 
V2 data source shall be grouped by partition values before the join happens. 
This could lead to data skew issues if a particular partition value is 
associated with a large amount of rows.

 

To combat this, we can introduce the idea of partially clustered distribution, 
which means that only one side of the join is required to be fully clustered, 
while the other side is not. This allows Spark to increase the parallelism of 
the join and avoid the data skewness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to