[ 
https://issues.apache.org/jira/browse/FLINK-24174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410895#comment-17410895
 ] 

Qingsheng Ren commented on FLINK-24174:
---------------------------------------

Thanks [~Jiangang] for reporting this. I think the issue is the same as 
FLINK-23807. Instead of using metrics as described in that ticket, we've 
decided to use RestClient to detect TM failure, which is more stable. 

> MiniClusterTestEnvironment‘s triggerTaskManagerFailover may stuck in 
> CommonTestUtils.waitForJobStatus()
> -------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-24174
>                 URL: https://issues.apache.org/jira/browse/FLINK-24174
>             Project: Flink
>          Issue Type: Improvement
>          Components: Test Infrastructure
>            Reporter: Liu
>            Priority: Major
>
> When writing taskmanager failover tests with [unified testing framework for 
> connectors|https://issues.apache.org/jira/browse/FLINK-19554], I find that it 
> may stuck in 
> CommonTestUtils.waitForJobStatus() as following:
>  # triggerTaskManagerFailover is called.
>  # JobStatus switched from RUNNING to RESTARTING.
>  # JobStatus switched from RESTARTING to RUNNING.
>  # The method terminateTaskManager() is completed.
>  # Since the jobStatus is RUNNING, CommonTestUtils.waitForJobStatus() will 
> never exit.
> A solution is to call terminateTaskManager() with async way. At the same 
> time, call 
> CommonTestUtils.waitForJobStatus(). The pseudo code can be as follow:
> {code:java}
> public void triggerTaskManagerFailover(JobClient jobClient, Runnable 
> afterFailAction)
>         throws Exception {
>     CompletableFuture<Void> completableFuture = terminateTaskManager();
>     CommonTestUtils.waitForJobStatus(
>             jobClient,
>             Arrays.asList(JobStatus.FAILING, JobStatus.FAILED, 
> JobStatus.RESTARTING),
>             Deadline.fromNow(Duration.ofMinutes(5)));
>     completableFuture.get();
>     afterFailAction.run();
>     startTaskManager();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to