pltbkd commented on pull request #16357:
URL: https://github.com/apache/flink/pull/16357#issuecomment-874719844


   Till and I have discussed a little in the mail list on the topic, and I'd 
like to share a point I came up there.
   
   > I was just considering that, since the environment is shared by JM and 
TMs, and the connections among TMs (using netty) are flaky in unstable 
environments, which will also cause the job failure, is it necessary to build a 
strongly guaranteed connection between JM and TMs, or it could be as flaky as 
the connections among TMs? As far as I know, connections among TMs will just 
fail on their first connection loss, so behaving like this in JM just means "as 
flaky as connections among TMs". 
   
   And here's the reply from Till in the mail list.
   > One simple approach could be to make the number of failed heartbeat RPCs 
until a target is marked as unreachable configurable because what represents a 
good enough criterion in one user's environment might produce too many 
false-positives in somebody else's environment. Or even simpler, one could say 
that one can disable reacting to a failed heartbeat RPC as it is currently the 
case.
   
   I think it's ok to introduce a new option to determine when JM can ensure a 
connection loss really means a TM loss. The option is in fact a part of 
heartbeat configuration, rather than akka configuration, and heartbeat options 
are all "expert" options, indicated by the document annotation. I suppose it's 
not a confusing option for experts. And of course we need to provide a proper 
default value for normal users.
   
   We may need another interval option if we use number of failed heartbeat 
RPCs as the new option. Current heartbeat interval is too long for it.
   
   Or maybe we can use another timeout option that we can keep retrying (with a 
non-configurable short delay) to connect to TM until the timeout. The main 
difference between it and current heartbeat timeout is that, current heartbeat 
timeout is for liveness probe, when we suppose the TM to be living, while the 
new timeout is for death probe, when we suppose the TM to be already dead, and 
we already have some evidences about this. So it can be shorter than heartbeat 
timeout, which means faster failure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to