pltbkd commented on pull request #16357:
URL: https://github.com/apache/flink/pull/16357#issuecomment-879546895


   I prefer to keep the mechanism simple and easy to understand, though I also 
agree that some expert configurations may be needed in specific scenarios. In 
the first version, a simple mechanism is enough for most scenarios, and current 
behavior can be reserved as default or alternative.
   
   >I guess it would have to be something like "If all (but at least 
LOWER_THRESHOLD) messages in Y time could not be delivered then the target is 
considered unreachable."
   
   I think this is a simple and clear rule, with a proper LOWER_THRESHOLD 
default value which users possibly don't need to modify. Time is the main 
concern and more environment-related, which is also approved by Liu in the mail 
list.
   
   >That only applies if there are actual data connections between task 
executors; it does not apply to jobs without shuffles, jobs that are currently 
being deployed (potentially crashing the TM in the process), nor cases where 
the JM/all TMs go down.
   
   It indeed increases the instability if we make connections between JM and TM 
as weak as that between TMs, especially when no connection exists between TMs. 
But I suppose it's acceptable to have the same experience in a certain cluster 
with different type of jobs, no matter whether data connections exist.
   Moreover, in a job without data connections, tasks in other TMs won't be 
blocked when one TM is lost, and the failover cost is lower. So faster 
detecting of TM lost is not as necessary as those jobs with data connections. 
Maybe we don't need to worry a lot for such jobs when we build the new 
mechanism since current one is possibly enough.
   
   
   By the way, similar to k8s' liveness probe, maybe it's better to respond 
heartbeat requests asap rather than queue it with other RPC requests, which I 
prefer to use as readiness probe. In this way, responding of heartbeat requests 
in a certain time can be more predictable, therefore the result of timeout is 
more reliable, and timeout can possibly be decreased. Though the connection 
loss is still not a reliable signal. 
   But since I'm not experienced in akka, I can't tell if it's possible or easy 
to implement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to