[ 
https://issues.apache.org/jira/browse/FLINK-22406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327344#comment-17327344
 ] 

Chesnay Schepler commented on FLINK-22406:
------------------------------------------

There are a few odd things in the logs. It seems like the JM is prematurely 
moving tasks into a canceled state.
{code:java}
23:47:20,274 INFO  o.a.f.r.executiongraph.ExecutionGraph       [] - Source: 
Custom Source -> Sink: Unnamed (2/2) (0d91787b2ba65cd0f259be619b293b96) 
switched from CREATED to DEPLOYING.
23:47:20,274 INFO  o.a.f.r.executiongraph.ExecutionGraph       [] - Source: 
Custom Source -> Sink: Unnamed (2/2) (0d91787b2ba65cd0f259be619b293b96) 
switched from DEPLOYING to CANCELING.
23:47:20,277 INFO  o.a.f.r.executiongraph.ExecutionGraph       [] - Source: 
Custom Source -> Sink: Unnamed (2/2) (0d91787b2ba65cd0f259be619b293b96) 
switched from CANCELING to CANCELED.
23:47:20,282 INFO  o.a.f.r.taskexecutor.TaskExecutor           [] - Received 
task Source: Custom Source -> Sink: Unnamed (2/2)#0 
(0d91787b2ba65cd0f259be619b293b96), deploy into slot with allocation id 
23:47:20,287 INFO  o.a.f.r.taskmanager.Task                    [] - Source: 
Custom Source -> Sink: Unnamed (2/2)#0 (0d91787b2ba65cd0f259be619b293b96) 
switched from CREATED to DEPLOYING.48a192cd6be4f34599cac87ad5d8caba.
23:47:20,296 INFO  o.a.f.r.taskmanager.Task                    [] - Source: 
Custom Source -> Sink: Unnamed (2/2)#0 (0d91787b2ba65cd0f259be619b293b96) 
switched from DEPLOYING to INITIALIZING.
23:47:20,327 WARN  o.a.f.r.taskmanager.Task                    [] - Source: 
Custom Source -> Sink: Unnamed (2/2)#0 (0d91787b2ba65cd0f259be619b293b96) 
switched from INITIALIZING to FAILED with failure cause: 
org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution 
attempt 0d91787b2ba65cd0f259be619b293b96 was not found. {code}
This doesn't necessarily explain the issue, but with a stray task hanging 
around for longer than we expect it to there is now the possibility that, after 
the downscaling has concluded, the number of active instances is 3. If the test 
thread enters the waiting loop at this time it will never exit, because we 
don't notify the thread if instances are shutting down. This is entirely 
theoretical though, but it is the only explanation I can come up with that 
could cause the test to get stuck.

> Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()
> -----------------------------------------------------------------
>
>                 Key: FLINK-22406
>                 URL: https://issues.apache.org/jira/browse/FLINK-22406
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 1.13.0
>            Reporter: Stephan Ewen
>            Assignee: Chesnay Schepler
>            Priority: Critical
>              Labels: test-stability
>
> The test is stalling on Azure CI.
> https://dev.azure.com/sewen0794/Flink/_build/results?buildId=292&view=logs&j=0a15d512-44ac-5ba5-97ab-13a5d066c22c&t=634cd701-c189-5dff-24cb-606ed884db87&l=4865



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to