[ 
https://issues.apache.org/jira/browse/SPARK-47612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated SPARK-47612:
---------------------------
    Description: 
Now we pick up the side of partially clustered distribution:

SPJ currently relies on a simple heuristic and always pick the side with less 
data size based on table statistics as the side fully clustered, even though it 
could also contain skewed partitions. 


We can potentially do fine-grained comparison based on partition values, since 
we have the information now.

  was:
Now we pick up the side of partially clustered distribution:


Using plan statistics to determine which side of join to fully
cluster partition values.

We can optimize to use partition size since we have the information now.


> Improve picking the side of partially clustered distribution accroding to 
> partition size
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-47612
>                 URL: https://issues.apache.org/jira/browse/SPARK-47612
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Qi Zhu
>            Priority: Major
>
> Now we pick up the side of partially clustered distribution:
> SPJ currently relies on a simple heuristic and always pick the side with less 
> data size based on table statistics as the side fully clustered, even though 
> it could also contain skewed partitions. 
> We can potentially do fine-grained comparison based on partition values, 
> since we have the information now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to