Maximilian Michels created FLINK-30773: ------------------------------------------
Summary: Allow rescaling of jobs based on per-vertex parallelism overrides Key: FLINK-30773 URL: https://issues.apache.org/jira/browse/FLINK-30773 Project: Flink Issue Type: New Feature Components: Autoscaler, Runtime / Coordination, Runtime / REST Reporter: Maximilian Michels Assignee: Maximilian Michels FLINK-29501 introduced a way to rescale jobs via a user-provided parallelism overrides map. This feature is already used today by the Autoscaler of the Flink Kubernetes operator. However, it requires a full restart of the Flink job and only supports the application deployment mode. In a K8s environment, this is inefficient because all pods for a deployment will be surrendered. Upon restart, they have to be re-acquired. In addition to being slow, this can also lead to situations where resource constraints prevent a restart from executing properly. Ideally, we would would want the following to happen on receiving a rescale request: # Rescale API receives a request with a parallelism overrides map (vertexId => parallelism) for a jobId # Compute the number of required task slots using the overrides and the current JobGraph ## If the total number of task slots for the cluster is less than the required number of task slots of the rescale, acquire the missing task slots. Otherwise, do nothing ## Wait for new task slots to become available ## Abort rescale request on timeout # Redeploy the JobGraph / Tasks with the updated parallelisms # Surrender any unused task slots in case of scaling down There is an existing "Rescaling" API which is currently disabled. We have to evaluate whether reusing it makes sense. -- This message was sent by Atlassian Jira (v8.20.10#820010)