Wilfred Spiegelenburg created YUNIKORN-1900:
-----------------------------------------------
Summary: Orphan allocation due to placeholder deletes
Key: YUNIKORN-1900
URL: https://issues.apache.org/jira/browse/YUNIKORN-1900
Project: Apache YuniKorn
Issue Type: Bug
Components: core - scheduler
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg
Gang scheduled applications can leave orphaned allocations. The reason this can
happen is that the gang scheduling setup is only specifying one taskgroup with
one member for the app.
This by itself is not a problem and works. A replacement of the placeholder
with the real allocation triggers the issue. It temporarily removes all
allocations and with only 1 gang member leaves no pending asks. That is the
trigger for the state change of the application to COMPLETING. This is correct
state change for the app if nothing is left, no allocations or asks.
Triggering the state change is however a problem. If the allocation of the
driver would not be a replacement the COMPLETING application moves to RUNNING
via a state update. We trigger a state change in that case and the issue does
not occur. For placeholder replacements we trigger the state change, if needed,
on the removal of the placeholder. Not when the real allocation is confirmed.
If the confirmation is processed before the COMPLETING state times out the
allocation is added to the node and never cleaned up. When the COMPLETING state
times out the application gets removed without the cleanup of the allocation.
The allocation cleanup does not get triggered as the COMPLETING state should
never be entered with allocations on the app.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]