[jira] [Commented] (FLINK-38687) Release Testing: Verify FLIP-370: Support Balanced Tasks Scheduling

RocMarshal (Jira) Sun, 23 Nov 2025 20:10:07 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-38687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040216#comment-18040216
 ]


RocMarshal commented on FLINK-38687:
------------------------------------

Thank you [~fcsaky] very much for your hard work!

>  I believe we should have a better UI representation of this somehow.

Maybe we need a new FLIP to add some pages or tips to show the related 
statistics information. 


For the unexpected result cases in Default Scheduler:

I attempted to reproduce the above issue on a PC using [~fcsaky]'s deployment 
and testing procedures. Unfortunately, I was unable to replicate the same 
scenario. Please note that this does not imply any ambiguity in [~fcsaky]'s 
test results, but rather indicates that such unexpected outcomes exhibit a 
certain degree of randomness under the current test case configuration.



This phenomenon or test result is more likely to be considered a non-blocking 
issue for the release. The main reasons are as follows:

a: The feature aims to ensure balanced task scheduling from the resource 
perspective of the job. During failover scenarios, when resources are released 
and resource requests are processed, slight discrepancies in allocation results 
may occur due to delayed updates in the resource view, leading to suboptimal 
task distribution balance. In such cases, we can improve the situation by 
appropriately increasing the value of slot.request.max-interval.

b: Based on the test results, although imbalance may occur, it does not affect 
the smoothness of the job scheduling process or the successful execution of the 
job.



It must be emphasized that if an imbalanced phenomenon still persists even when 
the value of slot.request.max-intervalis increased sufficiently, it should be 
recorded in the current Jira ticket or a new Jira ticket should be created to 
report the bug. The Jira ticket should include descriptions of 
scheduling-related configurations and the observed phenomena.

CC [~ruanhang1993] [~fcsaky] 

> Release Testing: Verify FLIP-370: Support Balanced Tasks Scheduling
> -------------------------------------------------------------------
>
>                 Key: FLINK-38687
>                 URL: https://issues.apache.org/jira/browse/FLINK-38687
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>            Reporter: RocMarshal
>            Assignee: Ferenc Csaky
>            Priority: Major
>         Attachments: Screenshot 2025-11-22 at 9.17.03.png, Screenshot 
> 2025-11-22 at 9.17.27.png, Screenshot 2025-11-22 at 9.17.54.png
>
>
> The original testing guide doc is here : 
> [https://docs.google.com/document/d/1ZXSwtwGeSxy8L2AHdpTnumhXNWWisho_a8dcxRYSvsk/edit?tab=t.0#heading=h.1vcje3u1wogz]
> And the content as follows:
> h1. 1 Motivation
> This document primarily introduces the core working principles of the 
> functionality introduced by Flip-370, as well as the key test cases that 
> cross-team testing should focus on to verify the correctness of the feature.
> h1. 2 You may need to be familiar with the core logic of balanced scheduling
> Please refer to this 
> [page|https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/tasks-scheduling/balanced_tasks_scheduling/].
>  
> h1. 3 Constructing and validating test cases
> As stated in the [FLIP|https://cwiki.apache.org/confluence/x/U56zDw] 
> document, task balanced scheduling is based on the SlotPool perspective of 
> the JobMaster to perform balanced task scheduling for a job. Therefore, all 
> test cases in this test can be verified under the application execution mode 
> (regardless of whether resources come from onYarn/onKubernetes).
> Testing jobs: [https://github.com/RocMarshal/flip370-testing-jobs] 
>  
> h2. 3.1 Test for a job that contains a slot sharing group
> h3. 3.1.1 Regular job test
>  * Test case code
>  * Entry-point 
> class：{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
>  * Code description: FlinkTestingJob describes a streaming job that contains 
> a default slot sharing group. The job includes a source operator with a 
> parallelism of 10 and a sink operator with a parallelism of 20.
>  * Job-level startup parameters：N.A
>  * Description of the necessary configurations for an application cluster
>  * 
>  ** 
>  *** _taskmanager.load-balance.mode: TASKS_
>  * 
>  ** 
>  *** _taskmanager.numberOfTaskSlots: 2_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.attempts: 32_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.delay: 10s_
>  * 
>  ** 
>  *** _jobmanager.scheduler: Adaptive/Default_ 
>  * Submit the flink job.
>  * Verification results
>  ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
>  *** Obtain the taskmanager on which each task is located through the 
> following steps 
>  **** !image-2025-11-17-11-33-45-545.png!
>  * 
>  ** 
>  *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)？*{color}
>  * 
>  ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
>  *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)？{color}*
> h3. 3.1.2 Failover scenario test
> h4. 3.1.2.1 Failover scenario test triggered by tasks
>  * Test case code
>  ** Entry-point 
> class：{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
>  ** Code description: FlinkTestingJob describes a streaming job that contains 
> a default slot sharing group. The job includes a source operator with a 
> parallelism of 10 and a sink operator with a parallelism of 20.
>  ** Job-level startup parameters are as follows: pass 300000 as a parameter 
> to the Flink job entry class, which indicates that the 0th subtask of the 
> source operator will throw a task exception every 5 minutes to trigger a job 
> failover:  _flip370.testing.slotsharinggroup.single.FlinkTestingJob 300000_
>  * Description of the necessary configurations for an application cluster
>  * 
>  ** 
>  *** _taskmanager.load-balance.mode: TASKS_
>  * 
>  ** 
>  *** _taskmanager.numberOfTaskSlots: 2_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.attempts: 32_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.delay: 10s_
>  * 
>  ** 
>  *** _jobmanager.scheduler: Adaptive/Default_ 
>  * Submit the flink job.
>  * Wait the task's exception for failover
>  * *{color:#de350b}Verification results{color}*
>  ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
>  *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?{color}*
>  ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
>  *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?{color}*
> h4. 3.1.2.2 Failover scenario test triggered by TaskManagers
>  * Test case code
>  * Entry-point 
> class：{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
>  * Code description: FlinkTestingJob describes a streaming job that contains 
> a default slot sharing group. The job includes a source operator with a 
> parallelism of 10 and a sink operator with a parallelism of 20.
>  * Job-level startup parameters: N.A.
>  * Description of the necessary configurations for an application cluster
>  * 
>  ** 
>  *** _taskmanager.load-balance.mode: TASKS_
>  * 
>  ** 
>  *** _taskmanager.numberOfTaskSlots: 2_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.attempts: 32_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.delay: 10s_
>  * 
>  ** 
>  *** _jobmanager.scheduler: Adaptive/Default_ 
>  * Submit the flink job.
>  * How to simulate TaskManager-level failures？
>  ** Manually kill any one or more TaskManager instances/containers in the job 
> cluster.
>  * Wait for failover completed.
>  * {color:#de350b}*Verification results*{color}
>  ** {color:#de350b}*For jobmanager.scheduler: Adaptive*{color}
>  *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?*{color}
>  ** {color:#de350b}*For jobmanager.scheduler: Default*{color}
>  *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?*{color}
> h2. 3.2 Test for a job that contains multiple slot sharing groups
> h3. 3.2.1 Regular job test
>  * Test case code
>  * Entry-point 
> class：{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
>  * Code description: FlinkTestingJob describes a streaming job that contains 
> a default slot sharing group and an ssg2 slot sharing group. Each slot 
> sharing group contains a source operator with a parallelism of 10 and a sink 
> operator with a parallelism of 20.
>  * Job-level startup parameters: N.A.
>  * Description of the necessary configurations for an application cluster
>  * 
>  ** 
>  *** _taskmanager.load-balance.mode: TASKS_
>  * 
>  ** 
>  *** _taskmanager.numberOfTaskSlots: 2_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.attempts: 32_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.delay: 10s_
>  * 
>  ** 
>  *** _jobmanager.scheduler: Adaptive/Default_ 
>  * Submit the flink job.
>  * *{color:#de350b}Verification results{color}*
>  ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
>  *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?{color}*
>  ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
>  *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?{color}*
> h3. 3.2.2 Failover scenario test
> h4. 3.2.2.1  Failover scenario test triggered by tasks
>  * Test case code
>  * Entry-point 
> class：{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
>  * Code description: FlinkTestingJob describes a streaming job that contains 
> a default slot sharing group and an ssg2 slot sharing group. Each slot 
> sharing group contains a source operator with a parallelism of 10 and a sink 
> operator with a parallelism of 20.
>  * Job-level startup parameters are as follows: pass 300000 as a parameter to 
> the Flink job entry class, which indicates that the 0th subtask of the source 
> operator will throw a task exception every 5 minutes to trigger a job 
> failover:  _flip370.testing.slotsharinggroup.single.FlinkTestingJob 300000_
>  * Description of the necessary configurations for an application cluster
>  * 
>  ** 
>  *** _taskmanager.load-balance.mode: TASKS_
>  * 
>  ** 
>  *** _taskmanager.numberOfTaskSlots: 2_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.attempts: 32_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.delay: 10s_
>  * 
>  ** 
>  *** _jobmanager.scheduler: Adaptive/Default_ 
>  * Submit the flink job.
>  * Wait the task's exception for failover
>  * {color:#de350b}*Verification results*{color}
>  ** {color:#de350b}*For jobmanager.scheduler: Adaptive*{color}
>  *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?*{color}
>  ** {color:#de350b}*For jobmanager.scheduler: Default*{color}
>  *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?*{color}
> h4. 3.2.2.2 Failover scenario test triggered by TaskManagers
>  * Test case code
>  * Entry-point 
> class：{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
>  * Code description: FlinkTestingJob describes a streaming job that contains 
> a default slot sharing group and an ssg2 slot sharing group. Each slot 
> sharing group contains a source operator with a parallelism of 10 and a sink 
> operator with a parallelism of 20.
>  * Job-level startup parameters: N.A.
>  * Description of the necessary configurations for an application cluster
>  * 
>  ** 
>  *** _taskmanager.load-balance.mode: TASKS_
>  * 
>  ** 
>  *** _taskmanager.numberOfTaskSlots: 2_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.attempts: 32_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.delay: 10s_
>  * 
>  ** 
>  *** _jobmanager.scheduler: Adaptive/Default_ 
>  * Submit the flink job.
>  * How to simulate TaskManager-level failures？
>  ** Manually kill any one or more TaskManager instances/containers in the job 
> cluster.
>  * Wait for failover completed.
>  * *{color:#de350b}Verification results{color}*
>  ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
>  *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?{color}*
>  ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
>  *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?{color}*
> Ping [~RocMarshal] if there’re any issues during the testing 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-38687) Release Testing: Verify FLIP-370: Support Balanced Tasks Scheduling

Reply via email to