Craig Condit created YUNIKORN-1670:
--------------------------------------
Summary: Application recovery can fail if app is rejected
Key: YUNIKORN-1670
URL: https://issues.apache.org/jira/browse/YUNIKORN-1670
Project: Apache YuniKorn
Issue Type: Bug
Components: shim - kubernetes
Reporter: Craig Condit
Assignee: Craig Condit
During application recovery, the current code waits up to 30 seconds for all
applications to transition to "Accepted". However, if an application is
rejected, or if the cluster is large enough, recovery will not succeed.
Similar to how informer sync was recently updated, we should modify the logic
to keep trying, but log periodically. Additionally, we should not look
specifically for Accepted state, but for state != New and != Recovering. This
ensures that we have processed all the applicaitons.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]