pltbkd commented on pull request #16357: URL: https://github.com/apache/flink/pull/16357#issuecomment-879546895
I prefer to keep the mechanism simple and easy to understand, though I also agree that some expert configurations may be needed in specific scenarios. In the first version, a simple mechanism is enough for most scenarios, and current behavior can be reserved as default or alternative. >I guess it would have to be something like "If all (but at least LOWER_THRESHOLD) messages in Y time could not be delivered then the target is considered unreachable." I think this is a simple and clear rule, with a proper LOWER_THRESHOLD default value which users possibly don't need to modify. Time is the main concern and more environment-related, which is also approved by Liu in the mail list. >That only applies if there are actual data connections between task executors; it does not apply to jobs without shuffles, jobs that are currently being deployed (potentially crashing the TM in the process), nor cases where the JM/all TMs go down. It indeed increases the instability if we make connections between JM and TM as weak as that between TMs, especially when no connection exists between TMs. But I suppose it's acceptable to have the same experience in a certain cluster with different type of jobs, no matter whether data connections exist. Moreover, in a job without data connections, tasks in other TMs won't be blocked when one TM is lost, and the failover cost is lower. So faster detecting of TM lost is not as necessary as those jobs with data connections. Maybe we don't need to worry a lot for such jobs when we build the new mechanism since current one is possibly enough. By the way, similar to k8s' liveness probe, maybe it's better to respond heartbeat requests asap rather than queue it with other RPC requests, which I prefer to use as readiness probe. In this way, responding of heartbeat requests in a certain time can be more predictable, therefore the result of timeout is more reliable, and timeout can possibly be decreased. Though the connection loss is still not a reliable signal. But since I'm not experienced in akka, I can't tell if it's possible or easy to implement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
