tillrohrmann commented on issue #8412: [FLINK-12111][tests] Harden 
AbstractTaskManagerProcessFailureRecoveryTest
URL: https://github.com/apache/flink/pull/8412#issuecomment-492549029
 
 
   After an offline discussion with @zentol, we concluded that the following is 
happening most likely:
   
   Killing the first TM process `TM1` does not happen synchronously. Therefore, 
it can happen that `TM1` sees the proceed marker file and finishes the 
computation. Next, the reducer is started and consumes the data from `TM1`. Now 
`TM1` is killed and the network stack signals the connection loss. Since the 
reducer is running with a parallelism of `1` and is deployed on `TM2`, the 
`ExecutionGraph` can be quickly restarted without waiting on the heartbeat to 
time out (because the `rpcTimeout` is set to `100 s` and the cancel attempts 
won't fail). Restarting the `ExecutionGraph` will result in deploying some 
tasks to `TM1` whose heartbeat hasn't been timed out. Last the heartbeat of 
`TM1` times out and the job fails a second time.
   
   The proposed solution is the following:
   
   * Wait for `TM1` to be terminated before creating the proceed marker file. 
That way the mappers running on `TM1` should never complete.
   * Set the number of allowed restarts to `2` in order to allow for two job 
restarts.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to