[jira] [Updated] (FLINK-34336) AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes

Rui Fan (Jira) Thu, 01 Feb 2024 21:42:24 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rui Fan updated FLINK-34336:
----------------------------
    Description: 
AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may 
hang in 
waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
{color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color}
h2. Reason:

The job has 2 tasks(vertices), after calling updateJobResourceRequirements. The 
source parallelism isn't changed (It's parallelism) , and the FlatMapper+Sink 
is changed from  parallelism to parallelism2.

So we expect the task number should be parallelism + parallelism2 instead of 
parallelism2.

 
h2. Why it can be passed for now?

Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by 
default. It means, flink job will rescale job 30 seconds after 
updateJobResourceRequirements is called.

 

So the running tasks are old parallelism when we call 
waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
{color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color}

IIUC, it cannot be guaranteed, and it's unexpected.

 
h2. How to reproduce this bug?

[https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6]
 * Disable the cooldown
 * Sleep for a while before waitForRunningTasks

If so, the job running in new parallelism, so `waitForRunningTasks` will hang 
forever.

  was:
AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may 
hang in 
waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
{color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color}
h2. Reason:

The job has 2 tasks(vertices), after calling updateJobResourceRequirements. The 
source parallelism isn't changed (It's parallelism) , and the FlatMapper+Sink 
is changed from  parallelism to parallelism2.

So we expect the task number should be parallelism + parallelism2 instead of 
parallelism2.

 
h2. Why it can be passed for now?

Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by 
default. It means, flink job will rescale job 30 seconds after 
updateJobResourceRequirements is called.

 

So the running tasks are old parallelism when we call 
waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
{color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color}

IIUC, it cannot be guaranteed, and it's unexpected.

 
h2. How to reproduce this bug?

[https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6]
 * Disable the cooldown
 * Sleep for a while before waitForRunningTasks


> AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState 
> may hang sometimes
> ---------------------------------------------------------------------------------------------
>
>                 Key: FLINK-34336
>                 URL: https://issues.apache.org/jira/browse/FLINK-34336
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 1.19.0
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.19.0
>
>
> AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState 
> may hang in 
> waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
> {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color}
> h2. Reason:
> The job has 2 tasks(vertices), after calling updateJobResourceRequirements. 
> The source parallelism isn't changed (It's parallelism) , and the 
> FlatMapper+Sink is changed from  parallelism to parallelism2.
> So we expect the task number should be parallelism + parallelism2 instead of 
> parallelism2.
>  
> h2. Why it can be passed for now?
> Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by 
> default. It means, flink job will rescale job 30 seconds after 
> updateJobResourceRequirements is called.
>  
> So the running tasks are old parallelism when we call 
> waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
> {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color}
> IIUC, it cannot be guaranteed, and it's unexpected.
>  
> h2. How to reproduce this bug?
> [https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6]
>  * Disable the cooldown
>  * Sleep for a while before waitForRunningTasks
> If so, the job running in new parallelism, so `waitForRunningTasks` will hang 
> forever.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-34336) AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes

Reply via email to