Maximilian Michels created FLINK-30773:
------------------------------------------

             Summary: Allow rescaling of jobs based on per-vertex parallelism 
overrides
                 Key: FLINK-30773
                 URL: https://issues.apache.org/jira/browse/FLINK-30773
             Project: Flink
          Issue Type: New Feature
          Components: Autoscaler, Runtime / Coordination, Runtime / REST
            Reporter: Maximilian Michels
            Assignee: Maximilian Michels


FLINK-29501 introduced a way to rescale jobs via a user-provided parallelism 
overrides map. This feature is already used today by the Autoscaler of the 
Flink Kubernetes operator. However, it requires a full restart of the Flink job 
and only supports the application deployment mode.

In a K8s environment, this is inefficient because all pods for a deployment 
will be surrendered. Upon restart, they have to be re-acquired. In addition to 
being slow, this can also lead to situations where resource constraints prevent 
a restart from executing properly.

Ideally, we would would want the following to happen on receiving a rescale 
request:
 # Rescale API receives a request with a parallelism overrides map (vertexId => 
parallelism) for a jobId
 # Compute the number of required task slots using the overrides and the 
current JobGraph
 ## If the total number of task slots for the cluster is less than the required 
number of task slots of the rescale, acquire the missing task slots. Otherwise, 
do nothing
 ## Wait for new task slots to become available
 ## Abort rescale request on timeout
 # Redeploy the JobGraph / Tasks with the updated parallelisms
 # Surrender any unused task slots in case of scaling down

 

There is an existing "Rescaling" API which is currently disabled. We have to 
evaluate whether reusing it makes sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to