craigcondit commented on code in PR #404:
URL: https://github.com/apache/yunikorn-k8shim/pull/404#discussion_r845300003
##########
pkg/shim/scheduler.go:
##########
@@ -213,7 +213,7 @@ func (ss *KubernetesShim) recoverSchedulerState(e
*fsm.Event) {
recoverableAppManagers =
append(recoverableAppManagers, m)
}
}
- if err := ss.context.WaitForRecovery(recoverableAppManagers,
30*time.Second); err != nil {
+ if err := ss.context.WaitForRecovery(recoverableAppManagers,
5*time.Minute); err != nil {
Review Comment:
I also debated what to use here for a timeout. I went through the code paths
that the recover event goes through, and it seems like the only error that can
occur is if the RM is not properly registered with the shim (which should never
happen). If it were to happen, it would fail regardless of how long a timeout
we provide.
My reasoning for a longer timeout is that while unlikely, it's not
impossible that we fail to finish recovery with a large enough cluster. Also,
we log every second how many nodes are waiting on recovery, so if recovery
fails we should be able to see this in the logs pretty clearly.
Personally, I'd rather not have a timeout at all and just do the recovery
synchronously, but that's a more invasive change at this point...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]