Wilfred Spiegelenburg created YUNIKORN-1900:
-----------------------------------------------

             Summary: Orphan allocation due to placeholder deletes
                 Key: YUNIKORN-1900
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1900
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: core - scheduler
            Reporter: Wilfred Spiegelenburg
            Assignee: Wilfred Spiegelenburg


Gang scheduled applications can leave orphaned allocations. The reason this can 
happen is that the gang scheduling setup is only specifying one taskgroup with 
one member for the app.
This by itself is not a problem and works. A replacement of the placeholder 
with the real allocation triggers the issue. It temporarily removes all 
allocations and with only 1 gang member leaves no pending asks. That is the 
trigger for the state change of the application to COMPLETING. This is correct 
state change for the app if nothing is left, no allocations or asks.

Triggering the state change is however a problem. If the allocation of the 
driver would not be a replacement the COMPLETING application moves to RUNNING 
via a state update. We trigger a state change in that case and the issue does 
not occur. For placeholder replacements we trigger the state change, if needed, 
on the removal of the placeholder. Not when the real allocation is confirmed.

If the confirmation is processed before the COMPLETING state times out the 
allocation is added to the node and never cleaned up. When the COMPLETING state 
times out the application gets removed without the cleanup of the allocation.

The allocation cleanup does not get triggered as the COMPLETING state should 
never be entered with allocations on the app.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to