[
https://issues.apache.org/jira/browse/FLINK-35285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868252#comment-17868252
]
Gyula Fora commented on FLINK-35285:
------------------------------------
Looping in [~mxm] as he may have input here too
> Autoscaler key group optimization can interfere with scale-down.max-factor
> --------------------------------------------------------------------------
>
> Key: FLINK-35285
> URL: https://issues.apache.org/jira/browse/FLINK-35285
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Reporter: Trystan
> Priority: Minor
>
> When setting a less aggressive scale down limit, the key group optimization
> can prevent a vertex from scaling down at all. It will hunt from target
> upwards to maxParallelism/2, and will always find currentParallelism again.
>
> A simple test trying to scale down from a parallelism of 60 with a
> scale-down.max-factor of 0.2:
> {code:java}
> assertEquals(48, JobVertexScaler.scale(60, inputShipStrategies, 360, .8, 8,
> 360)); {code}
>
> It seems reasonable to make a good attempt to spread data across subtasks,
> but not at the expense of total deadlock. The problem is that during scale
> down it doesn't actually ensure that newParallelism will be <
> currentParallelism. The only workaround is to set a scale down factor large
> enough such that it finds the next lowest divisor of the maxParallelism.
>
> Clunky, but something to ensure it can make at least some progress. There is
> another test that now fails, but just to illustrate the point:
> {code:java}
> for (int p = newParallelism; p <= maxParallelism / 2 && p <= upperBound; p++)
> {
> if ((scaleFactor < 1 && p < currentParallelism) || (scaleFactor > 1 && p
> > currentParallelism)) {
> if (maxParallelism % p == 0) {
> return p;
> }
> }
> } {code}
>
> Perhaps this is by design and not a bug, but total failure to scale down in
> order to keep optimized key groups does not seem ideal.
>
> Key group optimization block:
> [https://github.com/apache/flink-kubernetes-operator/blob/fe3d24e4500d6fcaed55250ccc816546886fd1cf/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java#L296C1-L303C10]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)