Weiwei Yang created YUNIKORN-549:
------------------------------------
Summary: Scheduler recovery failure occasionally while recovering
a large number of applications
Key: YUNIKORN-549
URL: https://issues.apache.org/jira/browse/YUNIKORN-549
Project: Apache YuniKorn
Issue Type: Improvement
Components: shim - kubernetes
Reporter: Weiwei Yang
Assignee: Weiwei Yang
Current recovery logic adds application back to based on the pods reported by
the informers/listers. In some conditions, the recovery of an app could fail if
the app has both Running and Pending pods. This is because the shim marks an
app with a Recovery if the informer notified the scheduler before the lister
function gets called. The is not working as expected consistently, we need a
stable implementation in order to tell if an app needs recovery or not.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]