[
https://issues.apache.org/jira/browse/FLINK-35926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
yuanfenghu updated FLINK-35926:
-------------------------------
Summary: During rescale, AdaptiveScheduler has incorrect judgment logic for
the max parallelism. (was: During rescale, jobmanager has incorrect judgment
logic for the max parallelism.)
> During rescale, AdaptiveScheduler has incorrect judgment logic for the max
> parallelism.
> ---------------------------------------------------------------------------------------
>
> Key: FLINK-35926
> URL: https://issues.apache.org/jira/browse/FLINK-35926
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.19.1
> Environment: flink-1.19.1
> There is a high probability that 1.18 has the same problem
> Reporter: yuanfenghu
> Priority: Blocker
> Attachments: image-2024-07-30-14-56-48-931.png,
> image-2024-07-30-14-59-26-976.png, image-2024-07-30-15-00-28-491.png
>
>
> When I was using the adaptive scheduler and modified the task in parallel
> through the rest api, an incorrect decision logic occurred, causing the task
> to fail.
> h2. produce:
> When I start a simple job with a parallelism of 128, the Max Parallelism of
> the job will be set to 256 (through flink's internal calculation logic). Then
> I make a savepoint on the job and modify the parallelism of the job to 1.
> Restore the job from the savepoint. At this time, the Max Parallelism of the
> job is still 256:
>
> !image-2024-07-30-14-56-48-931.png!
>
> this is as expected, at this time I call the rest api to increase the
> parallelism to 129 (which is obviously reasonable, since it is < 128), but
> the task throws an exception after restarting:
>
> !image-2024-07-30-14-59-26-976.png!
> At this time, when viewing the detailed information of the task, it is found
> that Max Parallelism has changed to 128:
>
> !image-2024-07-30-15-00-28-491.png!
>
> This can be reproduced stably locally
>
> h3. Causes:
>
> In AdaptiveScheduler we recalculate the job `VertexParallelismStore`,
> This results in the job after restart having the wrong max parallelism.
> , which seems to be related to FLINK-21844 and FLINK-22084 .
--
This message was sent by Atlassian Jira
(v8.20.10#820010)