Hi all,

I’ve been looking into how the autoscaler behaves with jobs that have a
large number of tasks and wanted to share some thoughts to start a
discussion.
The problem

Right now, the autoscaler implicitly assumes that each task gets a full
second of processing time. While this works in simple cases where there is
only one task, it breaks down in other situations.

If cgroups are not enabled on YARN, TaskManagers can use much more than
their allocated 1 vCore, leading to overallocation. When cgroups are
enabled, the autoscaler may try to maintain a target utilization that is
not realistically achievable due to CPU constraints. Backlog-based scaling
might keep the job running without lag, but the underlying behaviour is
hard to reason about.

Another challenge is that the target utilization metric does not reflect
actual CPU usage. This makes it hard for users to understand how efficient
their job really is or to tune the utilization threshold effectively across
a large number of jobs. I believe this is a fundamental limitation of the
original research paper
<https://www.usenix.org/system/files/osdi18-kalavri.pdf> that the Flink
autoscaler is based on. In a way, the research paper implicitly assumes
that slot-sharing
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/finegrained_resource/>
is disabled in Flink.
Potential Solution

One idea is to make the autoscaler aware of how tasks are placed and shift
the focus to TaskManager-level utilization instead of task-level.

For example, if sub-tasks a1, b2, and c5 are colocated on the same
TaskManager, their individual loads should be summed to represent the total
pressure on that TaskManager. The autoscaler could then scale based on the
average or peak pressure across all TaskManagers rather than treating tasks
in isolation.

A possible approach would be:

   -

   Use existing metrics like busy time and backpressure to estimate
   per-task load.
   -

   Group colocated tasks and aggregate their loads by TaskManager.
   -

   Calculate TaskManager-level utilization and use that as the signal for
   scaling.

This approach will not capture everything, such as background threads or
garbage collection, but it should be a step closer to aligning with real
CPU usage patterns.
Next steps

I have not worked out all the implementation details yet, and I admit this
approach adds significantly more complexity than I'd like. But given how
Flink places and runs tasks with slot sharing, I think the extra complexity
might be necessary.

I would love to get feedback from others before going further with a design
doc. Let me know what you think.

Thanks,
Sharath

Reply via email to