[
https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jungtaek Lim updated SPARK-38237:
---------------------------------
Description:
We still find HashClusteredDistribution be useful for batch query as well. For
example, we had a case with lower parallelism than expected due to the fact
ClusteredDistribution is used for aggregation which matches with
HashPartitioning with sub-key groups (note that the technical parallelism also
depends on "cardinality" - picking sub-key groups means having less
cardinality).
We propose to introduce a new config to require all cluster keys on Aggregate,
leveraging HashClusteredDistribution. That said, we propose to rename back
HashClusteredDistribution with retaining NOTE for stateful operator. The
distribution should not be still touched anyway due to the requirement of
stateful operator, but can be co-used with batch case if needed.
was:
We still find HashClusteredDistribution be useful for batch query as well. For
example, we had a case with lower parallelism than expected due to the fact
ClusteredDistribution is used for aggregation which matches with
HashPartitioning with sub-key groups (note that the technical parallelism also
depends on "cardinality" - picking sub-key groups means having less
cardinality).
We propose to rename back HashClusteredDistribution with retaining NOTE for
stateful operator. The distribution should not be still touched anyway due to
the requirement of stateful operator, but can be co-used with batch case if
needed.
> Introduce a new config to require all cluster keys on Aggregate
> ---------------------------------------------------------------
>
> Key: SPARK-38237
> URL: https://issues.apache.org/jira/browse/SPARK-38237
> Project: Spark
> Issue Type: Task
> Components: SQL, Structured Streaming
> Affects Versions: 3.3.0
> Reporter: Jungtaek Lim
> Priority: Major
>
> We still find HashClusteredDistribution be useful for batch query as well.
> For example, we had a case with lower parallelism than expected due to the
> fact ClusteredDistribution is used for aggregation which matches with
> HashPartitioning with sub-key groups (note that the technical parallelism
> also depends on "cardinality" - picking sub-key groups means having less
> cardinality).
> We propose to introduce a new config to require all cluster keys on
> Aggregate, leveraging HashClusteredDistribution. That said, we propose to
> rename back HashClusteredDistribution with retaining NOTE for stateful
> operator. The distribution should not be still touched anyway due to the
> requirement of stateful operator, but can be co-used with batch case if
> needed.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]