[jira] [Updated] (FLINK-35035) Reduce job pause time when cluster resources are expanded in adaptive mode

yuanfenghu (Jira) Sun, 07 Apr 2024 00:34:07 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


yuanfenghu updated FLINK-35035:
-------------------------------
    Description: 
When 'jobmanager.scheduler = adaptive' , job graph changes triggered by cluster 
expansion will cause long-term task stagnation. We should reduce this impact.
As an example:
I have jobgraph for : [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)]
When my cluster has 5 slots, the job will be executed as [v1 p5]->[v2 p5]
When I add slots the task will trigger jobgraph changes，by
org.apache.flink.runtime.scheduler.adaptive.ResourceListener#onNewResourcesAvailable，
However, the five new slots I added were not discovered at the same time (for 
convenience, I assume that a taskmanager has one slot), because no matter what 
environment we add, we cannot guarantee that the new slots will be added at 
once, so this will cause onNewResourcesAvailable triggers repeatedly
，If each new slot action has a certain interval, then the jobgraph will 
continue to change during this period. What I hope is that there will be a 
stable time to configure the cluster resources, and then go to it after the 
number of cluster slots has been stable for a certain period of time. Trigger 
jobgraph changes to avoid this situation

  was:
When 'jobmanager.scheduler = adaptive' , job graph changes triggered by cluster 
expansion will cause long-term task stagnation. We should reduce this impact.
As an example:
I have jobgraph for : [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)]
When my cluster has 5 slots, the job will be executed as [v1 p5]->[v2 p5]
When I add slots the task will trigger jobgraph changes，by
org.apache.flink.runtime.scheduler.adaptive.ResourceListener#onNewResourcesAvailable，
However, the five new slots I added were not discovered at the same time (for 
convenience, I assume that a taskmanager has one slot), because no matter what 
environment we add, we cannot guarantee that the new slots will be added at 
once, so this will cause onNewResourcesAvailable triggers repeatedly
，If each new slot action has a certain interval, then the jobgraph will 
continue to change during this period. What I hope is that there will be a 
stable time to configure the cluster resources, and then go to it after the 
number of cluster slots has been stable for a certain period of time. Trigger 
jobgraph changes to avoid this situation
 。


> Reduce job pause time when cluster resources are expanded in adaptive mode
> --------------------------------------------------------------------------
>
>                 Key: FLINK-35035
>                 URL: https://issues.apache.org/jira/browse/FLINK-35035
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Task
>    Affects Versions: 1.19.0
>            Reporter: yuanfenghu
>            Priority: Minor
>
> When 'jobmanager.scheduler = adaptive' , job graph changes triggered by 
> cluster expansion will cause long-term task stagnation. We should reduce this 
> impact.
> As an example:
> I have jobgraph for : [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)]
> When my cluster has 5 slots, the job will be executed as [v1 p5]->[v2 p5]
> When I add slots the task will trigger jobgraph changes，by
> org.apache.flink.runtime.scheduler.adaptive.ResourceListener#onNewResourcesAvailable，
> However, the five new slots I added were not discovered at the same time (for 
> convenience, I assume that a taskmanager has one slot), because no matter 
> what environment we add, we cannot guarantee that the new slots will be added 
> at once, so this will cause onNewResourcesAvailable triggers repeatedly
> ，If each new slot action has a certain interval, then the jobgraph will 
> continue to change during this period. What I hope is that there will be a 
> stable time to configure the cluster resources, and then go to it after the 
> number of cluster slots has been stable for a certain period of time. Trigger 
> jobgraph changes to avoid this situation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-35035) Reduce job pause time when cluster resources are expanded in adaptive mode

Reply via email to