[jira] [Commented] (FLINK-35035) Reduce job pause time when cluster resources are expanded in adaptive mode

yuanfenghu (Jira) Fri, 12 Apr 2024 01:32:05 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836504#comment-17836504
 ]


yuanfenghu commented on FLINK-35035:
------------------------------------

[~echauchot] 
Thank you for your patience in tracking this issue!
 
> The only thing is that you will have more frequent rescales (each time a slot 
> is added to the cluster) modulo slots that are added during the stabilization 
> period that do not lead to a rescale.

 
This is the problem. Imagine that I originally wanted to adjust the parallelism 
degree from 10 to 12. My execution step is to first adjust the maximum 
parallelism degree of my job to 12 through the rest api, and then I add tm to 
the cluster. If min-parallelism-increase=1 Then my job may trigger the scaling 
process twice when I change the number of slots from 10 to 12. This process may 
last for minutes, if min-parallelism-increase > 2, such as 5, Then my job has 
to wait until scaling-interval.max before scaling. I think we can optimize this 
process ，let the job trigger scaling exactly when slot becomes 12
 
 

> Reduce job pause time when cluster resources are expanded in adaptive mode
> --------------------------------------------------------------------------
>
>                 Key: FLINK-35035
>                 URL: https://issues.apache.org/jira/browse/FLINK-35035
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Task
>    Affects Versions: 1.19.0
>            Reporter: yuanfenghu
>            Priority: Minor
>
> When 'jobmanager.scheduler = adaptive' , job graph changes triggered by 
> cluster expansion will cause long-term task stagnation. We should reduce this 
> impact.
> As an example:
> I have jobgraph for : [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)]
> When my cluster has 5 slots, the job will be executed as [v1 p5]->[v2 p5]
> When I add slots the task will trigger jobgraph changes，by
> org.apache.flink.runtime.scheduler.adaptive.ResourceListener#onNewResourcesAvailable，
> However, the five new slots I added were not discovered at the same time (for 
> convenience, I assume that a taskmanager has one slot), because no matter 
> what environment we add, we cannot guarantee that the new slots will be added 
> at once, so this will cause onNewResourcesAvailable triggers repeatedly
> ，If each new slot action has a certain interval, then the jobgraph will 
> continue to change during this period. What I hope is that there will be a 
> stable time to configure the cluster resources  and then go to it after the 
> number of cluster slots has been stable for a certain period of time. Trigger 
> jobgraph changes to avoid this situation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35035) Reduce job pause time when cluster resources are expanded in adaptive mode

Reply via email to