tillrohrmann commented on pull request #16342: URL: https://github.com/apache/flink/pull/16342#issuecomment-872891379
Thanks for the review @zentol. I have addressed your comments and hardened the failing test cases. The problem of the `ResourceManagerTaskExecutorTest.testDelayedRegisterTaskExecutor` was that I changed how the `TimeoutException` message is generated in the `AkkaInvocationHandler#resolveTimeoutException` which missed the class name of the RPC. I've changed this by adding the declaring class name to the `RpcInvocation` message and their `toString` method. I think this is a good improvement because it makes it clearer which class one is calling a RPC on. The problem of the `TaskManagerProcessFailureBatchRecoveryITCase#AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure` is actually quite interesting. The test checks that a job gets recovered if a TM dies. The problem was that it was build on some very specific Flink behaviour to realize this. The test assumed that a recovery takes at least as long as the `akka.ask.timeout` because cancelling the tasks on the failed TM won't complete until this timeout expired. Moreover, it assumes that a TM will only be marked as dead if the hearbeat timeout expires. Therefore, by configuring the `akka.ask.timeout` to `100 s` and setting the `heartbeat.timeout` to `10 s`, it knew that there will only be 2 restart attempts because the second attempt will only be started after the `heartbeat.timeout` has been triggered. With my change, this behaviour changes because the cancel call won't block until the `akka.ask.timeout`. Instead it completes (or rather fails) right away, because it sees that the remote endpoint is no longer reachable. Therefore (and because we have set the delay between restarts to `0`), we will start right away with the next execution attempt. Since the dead TM has not been removed from the slot pool, we will reuse the slots for the second attempt and fail right away again when deploying the tasks. Since the TM will only be marked as dead after `10 s`, we will deplete our restart attempts and then fail the job. The solution to fixing this problem is to set the number of restart attempts to `Integer.MAX_VALUE` and to configure a delay between restarts of `1 s` so that we don't have a busy loop. With FLINK-23209, this will improve considerably, because then it will only take the `heartbeat.interval` to detect the dead TM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
