[jira] [Updated] (FLINK-35926) During rescale, AdaptiveScheduler has incorrect judgment logic for the max parallelism.

yuanfenghu (Jira) Tue, 30 Jul 2024 01:22:03 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-35926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


yuanfenghu updated FLINK-35926:
-------------------------------
    Description: 
When I was using the adaptive scheduler and modified the task in parallel 
through the rest api, an incorrect decision logic occurred, causing the task to 
fail.
h2. produce:

When I start a simple job with a parallelism of 128, the Max Parallelism of the 
job will be set to 256 (through flink's internal calculation logic). Then I 
make a savepoint on the job and modify the parallelism of the job to 1. Restore 
the job from the savepoint. At this time, the Max Parallelism of the job is 
still 256:
 
!image-2024-07-30-14-56-48-931.png!

 
this is as expected, at this time I call the rest api to increase the 
parallelism to 129 (which is obviously reasonable, since it is < 256), but the 
task throws an exception after restarting:
 
!image-2024-07-30-14-59-26-976.png!
At this time, when viewing the detailed information of the task, it is found 
that Max Parallelism has changed to 128:
 
!image-2024-07-30-15-00-28-491.png!

 
This can be reproduced stably locally
 
h3. Causes:

 
In AdaptiveScheduler we recalculate the job `VertexParallelismStore`,
This results in the job after restart having the wrong max parallelism.

, which seems to be related to FLINK-21844 and FLINK-22084 .

  was:
When I was using the adaptive scheduler and modified the task in parallel 
through the rest api, an incorrect decision logic occurred, causing the task to 
fail.
h2. produce:
When I start a simple job with a parallelism of 128, the Max Parallelism of the 
job will be set to 256 (through flink's internal calculation logic). Then I 
make a savepoint on the job and modify the parallelism of the job to 1. Restore 
the job from the savepoint. At this time, the Max Parallelism of the job is 
still 256:
 
!image-2024-07-30-14-56-48-931.png!

 
this is as expected, at this time I call the rest api to increase the 
parallelism to 129 (which is obviously reasonable, since it is < 128), but the 
task throws an exception after restarting:
 
!image-2024-07-30-14-59-26-976.png!
At this time, when viewing the detailed information of the task, it is found 
that Max Parallelism has changed to 128:
 
!image-2024-07-30-15-00-28-491.png!

 
This can be reproduced stably locally
 
h3. Causes:
 
In AdaptiveScheduler we recalculate the job `VertexParallelismStore`,
This results in the job after restart having the wrong max parallelism.

, which seems to be related to FLINK-21844 and FLINK-22084 .


> During rescale, AdaptiveScheduler has incorrect judgment logic for the max 
> parallelism.
> ---------------------------------------------------------------------------------------
>
>                 Key: FLINK-35926
>                 URL: https://issues.apache.org/jira/browse/FLINK-35926
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.19.1
>         Environment: flink-1.19.1
> There is a high probability that 1.18 has the same problem
>            Reporter: yuanfenghu
>            Priority: Blocker
>         Attachments: image-2024-07-30-14-56-48-931.png, 
> image-2024-07-30-14-59-26-976.png, image-2024-07-30-15-00-28-491.png
>
>
> When I was using the adaptive scheduler and modified the task in parallel 
> through the rest api, an incorrect decision logic occurred, causing the task 
> to fail.
> h2. produce:
> When I start a simple job with a parallelism of 128, the Max Parallelism of 
> the job will be set to 256 (through flink's internal calculation logic). Then 
> I make a savepoint on the job and modify the parallelism of the job to 1. 
> Restore the job from the savepoint. At this time, the Max Parallelism of the 
> job is still 256:
>  
> !image-2024-07-30-14-56-48-931.png!
>  
> this is as expected, at this time I call the rest api to increase the 
> parallelism to 129 (which is obviously reasonable, since it is < 256), but 
> the task throws an exception after restarting:
>  
> !image-2024-07-30-14-59-26-976.png!
> At this time, when viewing the detailed information of the task, it is found 
> that Max Parallelism has changed to 128:
>  
> !image-2024-07-30-15-00-28-491.png!
>  
> This can be reproduced stably locally
>  
> h3. Causes:
>  
> In AdaptiveScheduler we recalculate the job `VertexParallelismStore`,
> This results in the job after restart having the wrong max parallelism.
> , which seems to be related to FLINK-21844 and FLINK-22084 .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-35926) During rescale, AdaptiveScheduler has incorrect judgment logic for the max parallelism.

Reply via email to