[
https://issues.apache.org/jira/browse/FLINK-30773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17686646#comment-17686646
]
Maximilian Michels commented on FLINK-30773:
--------------------------------------------
Thanks for the pointers! Indeed, the rescaling logic can be much improved. I
think we will just go with an upfront reservation of the resources and then
perform a full job restart once the new resources are acquired, but I agree
that it would be great to optimize the rescaling further by intelligently
migrating the application state during a rescale operation. Note that we
already have an autoscaling implementation in the Flink Kubernetes operator
which was added in FLINK-30260.
> Add API for rescaling of jobs based on per-vertex parallelism overrides
> -----------------------------------------------------------------------
>
> Key: FLINK-30773
> URL: https://issues.apache.org/jira/browse/FLINK-30773
> Project: Flink
> Issue Type: New Feature
> Components: Autoscaler, Runtime / Coordination, Runtime / REST
> Reporter: Maximilian Michels
> Assignee: Maximilian Michels
> Priority: Major
> Attachments: meces.patch
>
>
> FLINK-29501 introduced a way to rescale jobs via a user-provided parallelism
> overrides map. This feature is already used today by the Autoscaler of the
> Flink Kubernetes operator. However, it requires a full restart of the Flink
> job and only supports the application deployment mode.
> In a K8s environment, this is inefficient because all pods for a deployment
> will be surrendered. Upon restart, they have to be re-acquired. In addition
> to being slow, this can also lead to situations where resource constraints
> prevent a restart from executing properly.
> Ideally, we would would want the following to happen on receiving a rescale
> request:
> # Rescale API receives a request with a parallelism overrides map (vertexId
> => parallelism) for a jobId
> # Compute the number of required task slots using the overrides and the
> current JobGraph
> ## If the total number of task slots for the cluster is less than the
> required number of task slots of the rescale, acquire the missing task slots.
> Otherwise, do nothing
> ## Wait for new task slots to become available
> ## Abort rescale request on timeout
> # Redeploy the JobGraph / Tasks with the updated parallelisms
> # Surrender any unused task slots in case of scaling down
>
> There is an existing "Rescaling" API which is currently disabled. We have to
> evaluate whether reusing it makes sense.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)