tillrohrmann commented on issue #8412: [FLINK-12111][tests] Harden AbstractTaskManagerProcessFailureRecoveryTest URL: https://github.com/apache/flink/pull/8412#issuecomment-492549029 After an offline discussion with @zentol, we concluded that the following is happening most likely: Killing the first TM process `TM1` does not happen synchronously. Therefore, it can happen that `TM1` sees the proceed marker file and finishes the computation. Next, the reducer is started and consumes the data from `TM1`. Now `TM1` is killed and the network stack signals the connection loss. Since the reducer is running with a parallelism of `1` and is deployed on `TM2`, the `ExecutionGraph` can be quickly restarted without waiting on the heartbeat to time out (because the `rpcTimeout` is set to `100 s` and the cancel attempts won't fail). Restarting the `ExecutionGraph` will result in deploying some tasks to `TM1` whose heartbeat hasn't been timed out. Last the heartbeat of `TM1` times out and the job fails a second time. The proposed solution is the following: * Wait for `TM1` to be terminated before creating the proceed marker file. That way the mappers running on `TM1` should never complete. * Set the number of allowed restarts to `2` in order to allow for two job restarts.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
