[jira] [Commented] (YUNIKORN-2171) race between node removal and scheduling cycle

Wilfred Spiegelenburg (Jira) Sun, 26 Nov 2023 17:43:04 -0800


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789882#comment-17789882
 ]


Wilfred Spiegelenburg commented on YUNIKORN-2171:
-------------------------------------------------

To explain the issue in more detail based on question in the PR:
{quote}As a first step in decommissioning, Node has been marked schedulable as 
false. This happens well before all the allocations cleanup, root queue 
resource adjustments etc. Since we are doing the check is node schedulable in 
all try*() calls, just wondering how do we encounter this problem between root 
queue adjustments & allocations cleanup?
{quote}
The problem is not the fact that the node is marked unschedulable or not. The 
node resources are removed from the root queue max resources. That happens 
before we remove the allocations from the application(s) and queue(s).
So assume the following: 
 * cluster with 2 nodes A and B: each with 32GB
 * the root queue max is set to 64GB.
 * both nodes have a usage of 30GB, root queue usage is thus 60GB
 * root queue has a headroom of 4GB (64GB - 60GB).
 * queues below the root do not have a max set (simplify the example)

Now we're allocating a 2GB allocation. Fits well inside the headroom of the 
root queue.
Allocation is assigned to node A, all fits and looks OK. When the allocation is 
added to the node and queue is when the issue starts.

During the time the scheduler is trying to find a node for the allocation the 
removal of node B is requested, and partially processed: the root queue max is 
set to 32GB. The allocations are not yet removed, which means the root queue 
usage is still 60GB.
We now try to update the usage of the queues as part of the node A 
confirmation. At the root queue the usage _would_ become 60GB + 2GB == 62GB. 
This fails.

The max of the root queue was set to 32GB due to the node removal. _Any_ 
increment would thus put the root queue over its max. Root queue usage will 
still be somewhere between 30GB and 60GB. Any allocation from the removed node 
that is left would cause the root queue to go over its max value as it already 
is over...
After this scheduling will not happen until all allocations from the removed 
node are removed. The headroom of the root queue will be negative until all is 
done. It thus only happens once and it fixes itself.

We should prevent this from happening anyway.

> race between node removal and scheduling cycle
> ----------------------------------------------
>
>                 Key: YUNIKORN-2171
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2171
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Critical
>              Labels: pull-request-available
>
> When a node gets removed the partition resources and thus the root max 
> resources are decreased. The node removal locks the partition, removes the 
> node and releases the partition lock before proceeding. Cleanup of the 
> allocations happens after that. This means that for a short period of time 
> the root queue max resources are already decreased while the usage is not.
> The scheduling cycle could be running during the node removal. The queue 
> headroom calculation uses the queue max resources and usage to calculate the 
> difference. The whole hierarchy is traversed for this.
> If the headroom is limited by the root queue then we could have a race 
> between the removal of the node allocations and scheduling:
>  * scheduling starts and queue headroom is calculated
>  * node is removed, queue max is lowered
>  * scheduling finds new allocation
>  * new allocation gets added to the queue updating usage
>  * root queue is over its max already or would go over max: scheduling fails
>  * node allocation removal proceeds and corrects the queue usage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-2171) race between node removal and scheduling cycle

Reply via email to