zentol commented on a change in pull request #17606:
URL: https://github.com/apache/flink/pull/17606#discussion_r740957930
##########
File path:
flink-tests/src/test/java/org/apache/flink/test/recovery/TaskManagerProcessFailureBatchRecoveryITCase.java
##########
@@ -67,7 +67,7 @@ public void testTaskManagerFailure(Configuration
configuration, final File coord
ExecutionEnvironment env =
ExecutionEnvironment.createRemoteEnvironment("localhost",
1337, configuration);
env.setParallelism(PARALLELISM);
- env.setRestartStrategy(RestartStrategies.fixedDelayRestart(1, 1500L));
+ env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, 1500L));
Review comment:
It looks like the failed heartbeats are being ignored while we are
waiting for the restart delay:
```
// 64589-0ef458 is the killed TM
// this is the last heartbeat
17532 o.a.f.r.jobmaster.JobMaster [] - Received heartbeat from
127.0.0.1:64589-0ef458.
...
<start job restart>
<cancel slot requests>
<cleanup partitions>
<various failed cancelTask RPCs>
17740 <reduce resource requirements to 0>
...
17440... JM idling, sending heartbeat requests
19212 o.a.f.r.jobmaster.JobMaster [] - Archive local failure causing attempt
05bcf9159a5a301d2f7b6566111235da to fail
...
19213 o.a.f.r.executiongraph.ExecutionGraph [] - Job Flink Java Job at Mon
Nov 01 15:42:30 CET 2021 (4daf5dcbf65f7cd384ac228ad72ab5c6) switched from state
RESTARTING to RUNNING.
19777 o.a.f.r.jobmaster.JobMaster [] - TaskManager with id
127.0.0.1:64589-0ef458 is no longer reachable.
19777 o.a.f.r.jobmaster.JobMaster [] - Disconnect TaskExecutor
127.0.0.1:64589-0ef458 because: TaskManager with id 127.0.0.1:64589-0ef458 is
no longer reachable.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]