Maximilian Michels created FLINK-30773:
------------------------------------------
Summary: Allow rescaling of jobs based on per-vertex parallelism
overrides
Key: FLINK-30773
URL: https://issues.apache.org/jira/browse/FLINK-30773
Project: Flink
Issue Type: New Feature
Components: Autoscaler, Runtime / Coordination, Runtime / REST
Reporter: Maximilian Michels
Assignee: Maximilian Michels
FLINK-29501 introduced a way to rescale jobs via a user-provided parallelism
overrides map. This feature is already used today by the Autoscaler of the
Flink Kubernetes operator. However, it requires a full restart of the Flink job
and only supports the application deployment mode.
In a K8s environment, this is inefficient because all pods for a deployment
will be surrendered. Upon restart, they have to be re-acquired. In addition to
being slow, this can also lead to situations where resource constraints prevent
a restart from executing properly.
Ideally, we would would want the following to happen on receiving a rescale
request:
# Rescale API receives a request with a parallelism overrides map (vertexId =>
parallelism) for a jobId
# Compute the number of required task slots using the overrides and the
current JobGraph
## If the total number of task slots for the cluster is less than the required
number of task slots of the rescale, acquire the missing task slots. Otherwise,
do nothing
## Wait for new task slots to become available
## Abort rescale request on timeout
# Redeploy the JobGraph / Tasks with the updated parallelisms
# Surrender any unused task slots in case of scaling down
There is an existing "Rescaling" API which is currently disabled. We have to
evaluate whether reusing it makes sense.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)