[
https://issues.apache.org/jira/browse/FLINK-35594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854757#comment-17854757
]
Rui Fan commented on FLINK-35594:
---------------------------------
This Jira may be duplicated with
https://issues.apache.org/jira/browse/FLINK-33977
> Downscaling doesn't release TaskManagers.
> -----------------------------------------
>
> Key: FLINK-35594
> URL: https://issues.apache.org/jira/browse/FLINK-35594
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.18.1
> Environment: * Flink 1.18.1 (Java 11, Temurin).
> * Kubernetes Operator 1.8
> * Kubernetes version v1.28.9-eks-036c24b (AWS EKS).
>
> Autoscaling configuration:
> {code:java}
> jobmanager.scheduler: adaptive
> job.autoscaler.enabled: "true"
> job.autoscaler.metrics.window: 15m
> job.autoscaler.stabilization.interval: 15m
> job.autoscaler.scaling.effectiveness.threshold: 0.2
> job.autoscaler.target.utilization: "0.75"
> job.autoscaler.target.utilization.boundary: "0.25"
> job.autoscaler.metrics.busy-time.aggregator: "AVG"
> job.autoscaler.restart.time-tracking.enabled: "true"{code}
> Reporter: Aviv Dozorets
> Priority: Major
> Attachments: Screenshot 2024-06-10 at 12.50.37 PM.png
>
>
> (Follow-up of Slack conversation on #troubleshooting channel).
> Recently I've observed a behavior, that should be improved:
> A Flink DataStream that runs with autoscaler (backed by Kubernetes operator)
> and Adaptive scheduler doesn't release a node (TaskManager) when scaling
> down. In my example job started with initial parallelism of 64, while having
> 4 TM with 16 cores each (1:1 core:slot) and scaled down to 16.
> My expectation: 1 TaskManager should be up and running.
> Reality: All 4 initial TaskManagers are running, with multiple and unequal
> amount of available slots.
>
> Didn't find an existing configuration to change the behavior.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)