[ 
https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-38237:
---------------------------------
    Description: 
We still find HashClusteredDistribution be useful for batch query as well. For 
example, we had a case with lower parallelism than expected due to the fact 
ClusteredDistribution is used for aggregation which matches with 
HashPartitioning with sub-key groups (note that the technical parallelism also 
depends on "cardinality" - picking sub-key groups means having less 
cardinality).

We propose to introduce a new config to require all cluster keys on Aggregate, 
leveraging HashClusteredDistribution. That said, we propose to rename back 
HashClusteredDistribution with retaining NOTE for stateful operator. The 
distribution should not be still touched anyway due to the requirement of 
stateful operator, but can be co-used with batch case if needed.

  was:
We still find HashClusteredDistribution be useful for batch query as well. For 
example, we had a case with lower parallelism than expected due to the fact 
ClusteredDistribution is used for aggregation which matches with 
HashPartitioning with sub-key groups (note that the technical parallelism also 
depends on "cardinality" - picking sub-key groups means having less 
cardinality).

We propose to rename back HashClusteredDistribution with retaining NOTE for 
stateful operator. The distribution should not be still touched anyway due to 
the requirement of stateful operator, but can be co-used with batch case if 
needed.


> Introduce a new config to require all cluster keys on Aggregate
> ---------------------------------------------------------------
>
>                 Key: SPARK-38237
>                 URL: https://issues.apache.org/jira/browse/SPARK-38237
>             Project: Spark
>          Issue Type: Task
>          Components: SQL, Structured Streaming
>    Affects Versions: 3.3.0
>            Reporter: Jungtaek Lim
>            Priority: Major
>
> We still find HashClusteredDistribution be useful for batch query as well. 
> For example, we had a case with lower parallelism than expected due to the 
> fact ClusteredDistribution is used for aggregation which matches with 
> HashPartitioning with sub-key groups (note that the technical parallelism 
> also depends on "cardinality" - picking sub-key groups means having less 
> cardinality).
> We propose to introduce a new config to require all cluster keys on 
> Aggregate, leveraging HashClusteredDistribution. That said, we propose to 
> rename back HashClusteredDistribution with retaining NOTE for stateful 
> operator. The distribution should not be still touched anyway due to the 
> requirement of stateful operator, but can be co-used with batch case if 
> needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to