[ 
https://issues.apache.org/jira/browse/FLINK-35926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuanfenghu updated FLINK-35926:
-------------------------------
    Summary: During rescale, AdaptiveScheduler has incorrect judgment logic for 
the max parallelism.  (was: During rescale, jobmanager has incorrect judgment 
logic for the max parallelism.)

> During rescale, AdaptiveScheduler has incorrect judgment logic for the max 
> parallelism.
> ---------------------------------------------------------------------------------------
>
>                 Key: FLINK-35926
>                 URL: https://issues.apache.org/jira/browse/FLINK-35926
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.19.1
>         Environment: flink-1.19.1
> There is a high probability that 1.18 has the same problem
>            Reporter: yuanfenghu
>            Priority: Blocker
>         Attachments: image-2024-07-30-14-56-48-931.png, 
> image-2024-07-30-14-59-26-976.png, image-2024-07-30-15-00-28-491.png
>
>
> When I was using the adaptive scheduler and modified the task in parallel 
> through the rest api, an incorrect decision logic occurred, causing the task 
> to fail.
> h2. produce:
> When I start a simple job with a parallelism of 128, the Max Parallelism of 
> the job will be set to 256 (through flink's internal calculation logic). Then 
> I make a savepoint on the job and modify the parallelism of the job to 1. 
> Restore the job from the savepoint. At this time, the Max Parallelism of the 
> job is still 256:
>  
> !image-2024-07-30-14-56-48-931.png!
>  
> this is as expected, at this time I call the rest api to increase the 
> parallelism to 129 (which is obviously reasonable, since it is < 128), but 
> the task throws an exception after restarting:
>  
> !image-2024-07-30-14-59-26-976.png!
> At this time, when viewing the detailed information of the task, it is found 
> that Max Parallelism has changed to 128:
>  
> !image-2024-07-30-15-00-28-491.png!
>  
> This can be reproduced stably locally
>  
> h3. Causes:
>  
> In AdaptiveScheduler we recalculate the job `VertexParallelismStore`,
> This results in the job after restart having the wrong max parallelism.
> , which seems to be related to FLINK-21844 and FLINK-22084 .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to