Peter Bacsko created YUNIKORN-1597:
--------------------------------------
Summary: Gang scheduling: application might not transition to
Running after recovery
Key: YUNIKORN-1597
URL: https://issues.apache.org/jira/browse/YUNIKORN-1597
Project: Apache YuniKorn
Issue Type: Bug
Components: shim - kubernetes
Reporter: Peter Bacsko
Assignee: Peter Bacsko
Pods get suck in a certain recovery scenario which involves gang scheduling.
High level overview:
1. All placeholders are running and allocated
2. The real pod is in Pending state
3. Yunikorn crashes and recovers
In this case, the real pod will not transition to Running. It's because:
1. Upon recovery, the state of recovered tasks will be set to "Allocated", not
"Bound".
2. If placeholder tasks are already running and allocated, there will be no
call to {{postTaskBound()}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]