Aviv Dozorets created FLINK-35594:
-------------------------------------
Summary: Downscaling doesn't release TaskManagers.
Key: FLINK-35594
URL: https://issues.apache.org/jira/browse/FLINK-35594
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Affects Versions: 1.18.1
Environment: * Flink 1.18.1 (Java 11, Temurin).
* Kubernetes Operator 1.8
* Kubernetes version v1.28.9-eks-036c24b (AWS EKS).
Autoscaling configuration:
{code:java}
jobmanager.scheduler: adaptive
job.autoscaler.enabled: "true"
job.autoscaler.metrics.window: 15m
job.autoscaler.stabilization.interval: 15m
job.autoscaler.scaling.effectiveness.threshold: 0.2
job.autoscaler.target.utilization: "0.75"
job.autoscaler.target.utilization.boundary: "0.25"
job.autoscaler.metrics.busy-time.aggregator: "AVG"
job.autoscaler.restart.time-tracking.enabled: "true"{code}
Reporter: Aviv Dozorets
Attachments: Screenshot 2024-06-10 at 12.50.37 PM.png
(Follow-up of Slack conversation on #troubleshooting channel).
Recently I've observed a behavior, that should be improved:
A Flink DataStream that runs with autoscaler (backed by Kubernetes operator)
and Adaptive scheduler doesn't release a node (TaskManager) when scaling down.
In my example job started with initial parallelism of 64, while having 4 TM
with 16 cores each (1:1 core:slot) and scaled down to 16.
My expectation: 1 TaskManager should be up and running.
Reality: All 4 initial TaskManagers are running, with multiple and unequal
amount of available slots.
Didn't find an existing configuration to change the behavior.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)