[ 
https://issues.apache.org/jira/browse/FLINK-22406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330479#comment-17330479
 ] 

Chesnay Schepler commented on FLINK-22406:
------------------------------------------

I managed to get the debug logs for a failed run, confirming my theory. The 
test starts out with 2 TMs, each with 2 slots. The test then in total runs 3 
jobs; after the first 2 slots arrive with p=2, then with p=4 after the next 2 
slots arrive, and one final with with p=2 again after we shutdown a TM.

The problem is caused by the first job, coupled with how test counts active 
subtasks.

The first job effectively is immediately canceled after having been deployed, 
which can cause the cancellation message to be processed before the task 
deployment, which fails because nothing is deployed (yet), and the JM marks the 
task is canceled and proceeds with a restart. The task deployment later 
arrives, and is processed as usual. This task sticks around until the task 
initialization is complete and the task transitions into a running state, with 
the corresponding state update being rejected by the JM.

After an offline discussion with [~trohrmann], we concluded that we could fix 
this by not checking what is actually being deployed, but instead poll the JM 
for the current parallelism.

> Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()
> -----------------------------------------------------------------
>
>                 Key: FLINK-22406
>                 URL: https://issues.apache.org/jira/browse/FLINK-22406
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 1.13.0
>            Reporter: Stephan Ewen
>            Assignee: Chesnay Schepler
>            Priority: Critical
>              Labels: test-stability
>
> The test is stalling on Azure CI.
> https://dev.azure.com/sewen0794/Flink/_build/results?buildId=292&view=logs&j=0a15d512-44ac-5ba5-97ab-13a5d066c22c&t=634cd701-c189-5dff-24cb-606ed884db87&l=4865



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to