[jira] [Commented] (FLINK-35035) Reduce job pause time when cluster resources are expanded in adaptive mode

yuanfenghu (Jira) Tue, 09 Apr 2024 19:08:03 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17835564#comment-17835564
 ]


yuanfenghu commented on FLINK-35035:
------------------------------------

[~echauchot] 
Thank you for your reply, but I have some questions:

jobmanager.adaptive-scheduler.min-parallelism-increase is a parameter on 
jobmanager, so I cannot update this value after the task is started. Assuming 
it is set to 5, this time it causes some problems:

The original task is [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)]. If I 
call restapi, the parallelism is overwritten to the new [v1 (maxp=12 minp = 1)] 
-> [v2 (maxp=12, minp=1)], then I added slots to the cluster, but obviously I 
only need to add 2 slots to meet the requirements, but because 
min-parallelism-increase was not reached, So this will not cause the task to 
trigger expansion. It needs to wait until scaling-interval.max is reached 
before triggering (scaling-interval.max needs to be set first). I think in this 
case, should I add a configuration item to support its triggering?
 

> Reduce job pause time when cluster resources are expanded in adaptive mode
> --------------------------------------------------------------------------
>
>                 Key: FLINK-35035
>                 URL: https://issues.apache.org/jira/browse/FLINK-35035
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Task
>    Affects Versions: 1.19.0
>            Reporter: yuanfenghu
>            Priority: Minor
>
> When 'jobmanager.scheduler = adaptive' , job graph changes triggered by 
> cluster expansion will cause long-term task stagnation. We should reduce this 
> impact.
> As an example:
> I have jobgraph for : [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)]
> When my cluster has 5 slots, the job will be executed as [v1 p5]->[v2 p5]
> When I add slots the task will trigger jobgraph changes，by
> org.apache.flink.runtime.scheduler.adaptive.ResourceListener#onNewResourcesAvailable，
> However, the five new slots I added were not discovered at the same time (for 
> convenience, I assume that a taskmanager has one slot), because no matter 
> what environment we add, we cannot guarantee that the new slots will be added 
> at once, so this will cause onNewResourcesAvailable triggers repeatedly
> ，If each new slot action has a certain interval, then the jobgraph will 
> continue to change during this period. What I hope is that there will be a 
> stable time to configure the cluster resources  and then go to it after the 
> number of cluster slots has been stable for a certain period of time. Trigger 
> jobgraph changes to avoid this situation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35035) Reduce job pause time when cluster resources are expanded in adaptive mode

Reply via email to