1996fanrui commented on code in PR #24340:
URL: https://github.com/apache/flink/pull/24340#discussion_r1495400609


##########
flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java:
##########
@@ -341,7 +343,7 @@ public void 
testCheckpointRescalingNonPartitionedStateCausesException() throws E
 
             restClusterClient.updateJobResourceRequirements(jobID, 
builder.build()).join();
 
-            waitForRunningTasks(restClusterClient, jobID, parallelism2);
+            waitForRunningTasks(restClusterClient, jobID, 2 * parallelism2);

Review Comment:
   `parallelism2` is the task number for before rescale, and `2 * parallelism2` 
is the task number for after rescale. Ideally, we should wait for all tasks 
after rescale to become running. 
   
   That's exactly what FLINK-34336 fix.
   
   > Why does it increase the hang possibility?
   
   When the rescale with cooldown=30s, the rescale action will be executed 
after 30s. When the `waitForRunningTasks` is called, the job is still running 
with old parallelism, so the `waitForRunningTasks` can be done. 
   
   When we disable the scaling cooldown, the rescale may happen immediately. 
When it happens, the job will be running with `2 * parallelism2`, so 
`waitForRunningTasks` with old parallelism2 will be hang forever.
   
   > And why did we merge it into master?
   
   IIUC, we should wait for all tasks after rescale to become running.  But 
these 2 callers wait for all tasks before rescale. These 2 wait are unexpected, 
so I think it's a separate bug regardless of we disable/enable cooldown. 
   
   It means we should wait for  all tasks after rescale even if we enable 
cooldown=30s. That's why I create a separate JIRA (FLINK-34336) to follow it.
   
   Why do I merge it into master? 
   
   1. I think `Disable the scaling cooldown to speed up the test` and 
FLINK-34336 are separate
   2. I guess `Disable the scaling cooldown to speed up the test` may be cause 
AutoRescalingITCase hang, so I prepare 
https://github.com/apache/flink/pull/24248 in advanced. It's easy to be merged 
to fix the hang.
   
   
   > Why do we merge it then now?
   
   1. `Disable the scaling cooldown to speed up the test` is a minor 
improvement for test instead of bugfix. So it's not necessary for 1.19 IIUC.
   2. 1.19 will be released soon, I'm afraid to affect this release. So I merge 
it in master(1.20) first.
   3. After FLINK-34226 is fixed, this minor improvement doesn't have any 
negative impact(I didn't notice any), so merging it may be make sense. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to