[
https://issues.apache.org/jira/browse/YUNIKORN-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767003#comment-17767003
]
Wilfred Spiegelenburg commented on YUNIKORN-1993:
-------------------------------------------------
Further detail on what happened based on the log and state dump analysis:
Our base was a Spark application that had defined one driver and one executor
placeholder.
Both placeholders got allocated. The real driver replaced the placeholder and
runs. There is never an executor pod created.
Now it becomes a little complex:
* placeholder timeout is triggered, removal of the unused placeholder is
started
* driver exits normally and the allocation gets removed
* driver allocation removal triggers a change to completing state as it is the
only _real_ allocation of the application
* ... time passes
* placeholder removal confirmation comes in from the shim and removal finishes
and releases the application lock.
* completing state times out and gets the application lock
* no allocations or placeholders left which triggers a transition to Completed
state
* on entry of Completed state the application queue link is set to nil
* placeholder removal has progressed to the point to update the queue
allocations
When we try to retrieve the queue we get a nil. The queue is not set anymore
and does not get updated…
We leak resources on the queue in the allocated accounting.
> Race between allocation removal and Completed state change
> ----------------------------------------------------------
>
> Key: YUNIKORN-1993
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1993
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: Wilfred Spiegelenburg
> Assignee: Wilfred Spiegelenburg
> Priority: Critical
>
> A race between go routines exists that can leave allocation tracked on a
> queue. The end result could show a queue that has allocation without any
> running applications in the queue.
> Worst case scenario would be an exhausted root queue quota causing all
> scheduling to stop.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]