Hi all, I’ve been looking into how the autoscaler behaves with jobs that have a large number of tasks and wanted to share some thoughts to start a discussion. The problem
Right now, the autoscaler implicitly assumes that each task gets a full second of processing time. While this works in simple cases where there is only one task, it breaks down in other situations. If cgroups are not enabled on YARN, TaskManagers can use much more than their allocated 1 vCore, leading to overallocation. When cgroups are enabled, the autoscaler may try to maintain a target utilization that is not realistically achievable due to CPU constraints. Backlog-based scaling might keep the job running without lag, but the underlying behaviour is hard to reason about. Another challenge is that the target utilization metric does not reflect actual CPU usage. This makes it hard for users to understand how efficient their job really is or to tune the utilization threshold effectively across a large number of jobs. I believe this is a fundamental limitation of the original research paper <https://www.usenix.org/system/files/osdi18-kalavri.pdf> that the Flink autoscaler is based on. In a way, the research paper implicitly assumes that slot-sharing <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/finegrained_resource/> is disabled in Flink. Potential Solution One idea is to make the autoscaler aware of how tasks are placed and shift the focus to TaskManager-level utilization instead of task-level. For example, if sub-tasks a1, b2, and c5 are colocated on the same TaskManager, their individual loads should be summed to represent the total pressure on that TaskManager. The autoscaler could then scale based on the average or peak pressure across all TaskManagers rather than treating tasks in isolation. A possible approach would be: - Use existing metrics like busy time and backpressure to estimate per-task load. - Group colocated tasks and aggregate their loads by TaskManager. - Calculate TaskManager-level utilization and use that as the signal for scaling. This approach will not capture everything, such as background threads or garbage collection, but it should be a step closer to aligning with real CPU usage patterns. Next steps I have not worked out all the implementation details yet, and I admit this approach adds significantly more complexity than I'd like. But given how Flink places and runs tasks with slot sharing, I think the extra complexity might be necessary. I would love to get feedback from others before going further with a design doc. Let me know what you think. Thanks, Sharath