[jira] [Comment Edited] (FLINK-35035) Reduce job pause time when cluster resources are expanded in adaptive mode

yuanfenghu (Jira) Tue, 09 Apr 2024 19:20:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17835564#comment-17835564
 ]


yuanfenghu edited comment on FLINK-35035 at 4/10/24 2:19 AM:
-------------------------------------------------------------

[~echauchot] 
Thank you for your reply, but I have some questions:

jobmanager.adaptive-scheduler.min-parallelism-increase is a parameter on 
jobmanager, so I cannot update this value after the cluster is started. 
Assuming it is set to 5, this time it causes some problems:

The original task is [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)]. If I 
call restapi, the parallelism is overwritten to the new [v1 (maxp=12 minp = 1)] 
-> [v2 (maxp=12, minp=1)], then I added slots to the cluster, but obviously I 
only need to add 2 slots to meet the requirements, but because 
min-parallelism-increase was not reached, So this will not cause the task to 
trigger expansion. It needs to wait until scaling-interval.max is reached 
before triggering (scaling-interval.max needs to be set first). I think in this 
case, should I add a configuration item to support its triggering?

 

Maybe can add a switch similar to 
jobmanager.adaptive-scheduler.min-parallelism-increase. When the resource 
changes, it will be judged whether the current resource fully meets the 
parallelism requirements of the job. If it is satisfied, rescheduling will be 
triggered directly. If it is not satisfied, it will be rescheduled in after 
scaling-interval.max . WDYT? [~echauchot] 

Looking forward to your reply!

 


was (Author: JIRAUSER296932):
[~echauchot] 
Thank you for your reply, but I have some questions:

jobmanager.adaptive-scheduler.min-parallelism-increase is a parameter on 
jobmanager, so I cannot update this value after the cluster is started. 
Assuming it is set to 5, this time it causes some problems:

The original task is [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)]. If I 
call restapi, the parallelism is overwritten to the new [v1 (maxp=12 minp = 1)] 
-> [v2 (maxp=12, minp=1)], then I added slots to the cluster, but obviously I 
only need to add 2 slots to meet the requirements, but because 
min-parallelism-increase was not reached, So this will not cause the task to 
trigger expansion. It needs to wait until scaling-interval.max is reached 
before triggering (scaling-interval.max needs to be set first). I think in this 
case, should I add a configuration item to support its triggering?
 

> Reduce job pause time when cluster resources are expanded in adaptive mode
> --------------------------------------------------------------------------
>
>                 Key: FLINK-35035
>                 URL: https://issues.apache.org/jira/browse/FLINK-35035
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Task
>    Affects Versions: 1.19.0
>            Reporter: yuanfenghu
>            Priority: Minor
>
> When 'jobmanager.scheduler = adaptive' , job graph changes triggered by 
> cluster expansion will cause long-term task stagnation. We should reduce this 
> impact.
> As an example:
> I have jobgraph for : [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)]
> When my cluster has 5 slots, the job will be executed as [v1 p5]->[v2 p5]
> When I add slots the task will trigger jobgraph changes，by
> org.apache.flink.runtime.scheduler.adaptive.ResourceListener#onNewResourcesAvailable，
> However, the five new slots I added were not discovered at the same time (for 
> convenience, I assume that a taskmanager has one slot), because no matter 
> what environment we add, we cannot guarantee that the new slots will be added 
> at once, so this will cause onNewResourcesAvailable triggers repeatedly
> ，If each new slot action has a certain interval, then the jobgraph will 
> continue to change during this period. What I hope is that there will be a 
> stable time to configure the cluster resources  and then go to it after the 
> number of cluster slots has been stable for a certain period of time. Trigger 
> jobgraph changes to avoid this situation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-35035) Reduce job pause time when cluster resources are expanded in adaptive mode

Reply via email to