zentol commented on issue #8412: [FLINK-12111][tests] Harden 
AbstractTaskManagerProcessFailureRecoveryTest
URL: https://github.com/apache/flink/pull/8412#issuecomment-492525060
 
 
   It weakens the test in some regards (as this kind of weird timing issues is 
no longer covered) but strengthens it in other areas (BATCH mode being 
_actually_ tested).
   
   I tried to reproduce the scenario you described by upping the heartbeat 
timeout, so that the network stack always fails first. This however didn't 
work; the restart was delayed since all tasks on the TM that timed out were 
stuck in a CANCELING state. Only once the heartbeat actually timed out could 
the restart proceed. As such I'm no longer sure whether this scenario can 
actually occur.
   
   Upping the restart delay should work, will implement that right away.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to