tillrohrmann commented on pull request #16342:
URL: https://github.com/apache/flink/pull/16342#issuecomment-872891379


   Thanks for the review @zentol. I have addressed your comments and hardened 
the failing test cases.
   
   The problem of the 
`ResourceManagerTaskExecutorTest.testDelayedRegisterTaskExecutor` was that I 
changed how the `TimeoutException` message is generated in the 
`AkkaInvocationHandler#resolveTimeoutException` which missed the class name of 
the RPC. I've changed this by adding the declaring class name to the 
`RpcInvocation` message and their `toString` method. I think this is a good 
improvement because it makes it clearer which class one is calling a RPC on.
   
   The problem of the 
`TaskManagerProcessFailureBatchRecoveryITCase#AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure`
 is actually quite interesting. The test checks that a job gets recovered if a 
TM dies. The problem was that it was build on some very specific Flink 
behaviour to realize this. The test assumed that a recovery takes at least as 
long as the `akka.ask.timeout` because cancelling the tasks on the failed TM 
won't complete until this timeout expired. Moreover, it assumes that a TM will 
only be marked as dead if the hearbeat timeout expires. Therefore, by 
configuring the `akka.ask.timeout` to `100 s` and setting the 
`heartbeat.timeout` to `10 s`, it knew that there will only be 2 restart 
attempts because the second attempt will only be started after the 
`heartbeat.timeout` has been triggered.
   
   With my change, this behaviour changes because the cancel call won't block 
until the `akka.ask.timeout`. Instead it completes (or rather fails) right 
away, because it sees that the remote endpoint is no longer reachable. 
Therefore (and because we have set the delay between restarts to `0`), we will 
start right away with the next execution attempt. Since the dead TM has not 
been removed from the slot pool, we will reuse the slots for the second attempt 
and fail right away again when deploying the tasks. Since the TM will only be 
marked as dead after `10 s`, we will deplete our restart attempts and then fail 
the job.
   
   The solution to fixing this problem is to set the number of restart attempts 
to `Integer.MAX_VALUE` and to configure a delay between restarts of `1 s` so 
that we don't have a busy loop. With FLINK-23209, this will improve 
considerably, because then it will only take the `heartbeat.interval` to detect 
the dead TM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to