[
https://issues.apache.org/jira/browse/FLINK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852748#comment-17852748
]
Matthias Pohl edited comment on FLINK-35035 at 6/6/24 11:47 AM:
----------------------------------------------------------------
Thanks for the pointer, [~dmvk]. We looked into this issue while working on
[FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]
(which is kind of related) and plan to do a follow-up FLIP that will align the
resource controlling mechanism of the {{{}AdaptiveScheduler{}}}'s
{{WaitingForResources}} and {{Executing}} states.
Currently, we have parameters intervening in the rescaling in different places
([j.a.scaling-interval.min|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-min],
[j.a.scaling-interval.max|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-max]
being utilized in {{Executing}} and
[j.a.resource-stabilization-timeout|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout]
being utilized in {{{}WaitingForResources){}}}. Having a
{{resource-stabilization}} phase in {{Executing}} should resolve the problem
described in this Jira issue here.
was (Author: mapohl):
Thanks for the pointer, [~dmvk]. We looked into this issue while working on
[FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]
(which is kind of related) and plan to do a follow-up FLIP that will align the
resource controlling mechanism of the {{AdaptiveScheduler}}'s
{{WaitingForResources}} and {{Executing}} states.
Currently, we have parameters intervening in the rescaling in different places
([j.a.scaling-interval.min|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-min],
[j.a.scaling-interval.max|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-max]
being utilized in {{Executing}} and
[j.a.resource-stabilization-timeout|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout)
being utilized in {{WaitingForResources}}). Having a
{{resource-stabilization}} phase in {{Executing}} should resolve the problem
described in this Jira issue here.
> Reduce job pause time when cluster resources are expanded in adaptive mode
> --------------------------------------------------------------------------
>
> Key: FLINK-35035
> URL: https://issues.apache.org/jira/browse/FLINK-35035
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Task
> Affects Versions: 1.19.0
> Reporter: yuanfenghu
> Priority: Minor
>
> When 'jobmanager.scheduler = adaptive' , job graph changes triggered by
> cluster expansion will cause long-term task stagnation. We should reduce this
> impact.
> As an example:
> I have jobgraph for : [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)]
> When my cluster has 5 slots, the job will be executed as [v1 p5]->[v2 p5]
> When I add slots the task will trigger jobgraph changes,by
> org.apache.flink.runtime.scheduler.adaptive.ResourceListener#onNewResourcesAvailable,
> However, the five new slots I added were not discovered at the same time (for
> convenience, I assume that a taskmanager has one slot), because no matter
> what environment we add, we cannot guarantee that the new slots will be added
> at once, so this will cause onNewResourcesAvailable triggers repeatedly
> ,If each new slot action has a certain interval, then the jobgraph will
> continue to change during this period. What I hope is that there will be a
> stable time to configure the cluster resources and then go to it after the
> number of cluster slots has been stable for a certain period of time. Trigger
> jobgraph changes to avoid this situation
--
This message was sent by Atlassian Jira
(v8.20.10#820010)