craigcondit commented on code in PR #404:
URL: https://github.com/apache/yunikorn-k8shim/pull/404#discussion_r845300003


##########
pkg/shim/scheduler.go:
##########
@@ -213,7 +213,7 @@ func (ss *KubernetesShim) recoverSchedulerState(e 
*fsm.Event) {
                                recoverableAppManagers = 
append(recoverableAppManagers, m)
                        }
                }
-               if err := ss.context.WaitForRecovery(recoverableAppManagers, 
30*time.Second); err != nil {
+               if err := ss.context.WaitForRecovery(recoverableAppManagers, 
5*time.Minute); err != nil {

Review Comment:
   I also debated what to use here for a timeout. I went through the code paths 
that the recover event goes through, and it seems like the only error that can 
occur is if the RM is not properly registered with the shim (which should never 
happen). If it were to happen, it would fail regardless of how long a timeout 
we provide.
   
   My reasoning for a longer timeout is that while unlikely, it's not 
impossible that we fail to finish recovery with a large enough cluster. Also, 
we log every second how many nodes are waiting on recovery, so if recovery 
fails we should be able to see this in the logs pretty clearly.
   
   Personally, I'd rather not have a timeout at all and just do the recovery 
synchronously, but that's a more invasive change at this point...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to