[
https://issues.apache.org/jira/browse/YUNIKORN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790013#comment-17790013
]
Wilfred Spiegelenburg commented on YUNIKORN-2171:
-------------------------------------------------
To clarify: this is more likely to happen if the headroom in the root queue is
less than the size of the node that is being removed. In case the headroom is
larger than the node the allocation needs to be large enough to bridge that
difference. The combination of allocation size node usage and left over
headroom is then important.
> race between node removal and scheduling cycle
> ----------------------------------------------
>
> Key: YUNIKORN-2171
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2171
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0
> Reporter: Wilfred Spiegelenburg
> Assignee: Wilfred Spiegelenburg
> Priority: Critical
> Labels: pull-request-available
>
> When a node gets removed the partition resources and thus the root max
> resources are decreased. The node removal locks the partition, removes the
> node and releases the partition lock before proceeding. Cleanup of the
> allocations happens after that. This means that for a short period of time
> the root queue max resources are already decreased while the usage is not.
> The scheduling cycle could be running during the node removal. The queue
> headroom calculation uses the queue max resources and usage to calculate the
> difference. The whole hierarchy is traversed for this.
> If the headroom is limited by the root queue then we could have a race
> between the removal of the node allocations and scheduling:
> * scheduling starts and queue headroom is calculated
> * node is removed, queue max is lowered
> * scheduling finds new allocation
> * new allocation gets added to the queue updating usage
> * root queue is over its max already or would go over max: scheduling fails
> * node allocation removal proceeds and corrects the queue usage
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]