Arvid Heise created FLINK-23807:
-----------------------------------

             Summary: Use metrics to detect restarts in 
MiniClusterTestEnvironment#triggerTaskManagerFailover
                 Key: FLINK-23807
                 URL: https://issues.apache.org/jira/browse/FLINK-23807
             Project: Flink
          Issue Type: Bug
          Components: Connectors / Common
            Reporter: Arvid Heise
             Fix For: 1.14.0


{{MiniClusterTestEnvironment#triggerTaskManagerFailover}} checks the job status 
to detect a restart 
{noformat}
        terminateTaskManager();
        CommonTestUtils.waitForJobStatus(
                jobClient,
                Arrays.asList(JobStatus.FAILING, JobStatus.FAILED, 
JobStatus.RESTARTING),
                Deadline.fromNow(Duration.ofMinutes(5)));
        afterFailAction.run();
        startTaskManager();
{noformat}
However, `waitForJobStatus` polls every 100ms while the restart can happen 
within 10ms and thus can easily miss the actual restart and wait forever (or 
when the next restart happens because slots are missing).

We should rather use the metric `numRestarts`, check before the induced error, 
and wait until the counter increased.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to