[ 
https://issues.apache.org/jira/browse/YUNIKORN-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manikandan R reassigned YUNIKORN-3092:
--------------------------------------

    Assignee: Manikandan R

> Reservations can permanently block nodes, leading to preemption failure and a 
> stuck scheduler state
> ---------------------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-3092
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3092
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 1.6.3
>            Reporter: John Daciuk
>            Assignee: Manikandan R
>            Priority: Minor
>              Labels: preemption
>         Attachments: Screenshot 2025-06-23 at 8.28.55 PM.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h2. Context
> Since deploying Yunikorn back in October 2024 we've encountered occasional 
> preemption misses. We find a high priority pod pending for hours, manually 
> delete a low priority pod, then see the high priority pod schedule. 
> We dug into this earlier this year and found the upgrade from 1.5.2 to 1.6.2 
> helpful. In particular [this 
> PR|https://github.com/apache/yunikorn-core/pull/1001] achieved 100% expected 
> preemption in our testing due to it's reservation removal logic. However we 
> still find that 1.6.2 is not reliable with respect to preemption in practice. 
> And we can repro preemption misses by scaling up our original preemption load 
> test by 4x.
> h2. Repro
> With Yunikorn 1.6.3 schedule ~400 low priority pods that live forever and 
> fill up all node capacity. Once they are running, schedule the same number of 
> high priority pods to a different queue. Use the same resources for all the 
> pods. 
> We expect that all the high priority pods will eventually schedule. However 
> we find about 10% of them stuck pending. This can be seen in the screenshot 
> attached, where the high priority pods are tier0.
> If we add logging like in [this diff from 
> branch-1.6|https://github.com/apache/yunikorn-core/compare/branch-1.6...jdaciuk:yunikorn-core:jdaciuk-1.6]
>  we see
> {quote}{{{}2025-06-23T04:54:26.776Z    INFO    core.scheduler.preemption    
> objects/preemption.go:93    Removing node ip-100-76-60-239.ec2.internal from 
> consideration. node.IsReserved: true, node reservations: 
> map[847f05e1-f74c-403a-8033-154cd76d89c0:ip-100-76-60-239.ec2.internal -> 
> tier0-1-395-157140|847f05e1-f74c-403a-8033-154cd76d89c0], node fits ask: true 
>    {"applicationID": "tier0-1-406-328120", "allocationKey": 
> "e589c683-faf1-4793-97b8-c5f3b3bc34b5", "author": "MLP"{}}}}
> {quote}
> A node (with tons of potential victims) ip-100-76-60-239.ec2.internal is 
> removed from consideration for preemption because it's reserved. Looking at 
> the reservation map above, we see that pod tier0-1-395 has the reservation.
> The pod tier0-1-395 is blocking the entire node. Why can't it schedule and 
> release the reservation?
> {quote}{{{}2025-06-23T04:43:45.942Z    INFO    core.scheduler.application    
> objects/application.go:1008    tryAllocate did not find a candidate 
> allocation in the node iterator, allowPreemption: true, 
> preemptAttemptsRemaining: 0    {"applicationID": "tier0-1-395-157140", 
> "author": "MLP"{}}}}
> {quote}
> Because there's no more preemption attempts allowed for the particular queue 
> this cycle. And unfortunately this situation repeats itself every cycle since 
> pod tier0-1-395 is not among the first in the queue to ever tryAllocate.
> h2. Thoughts
> This is one example, but there are a number of ways we can get stuck in such 
> a cycle. Particular to the preemption failure here, it seems like we need 
> some way to either remove the dead reservation or ignore it while considering 
> preemption victims.
> So for example, when we iterate through the nodes in [this 
> code|https://github.com/apache/yunikorn-core/blob/master/pkg/scheduler/objects/preemption.go#L163],
>  perhaps we could first try with filtering out reserved nodes (as the code 
> is) then try another loop ignoring and/or breaking reservations if we find 
> victims. Would ignoring the reservation be enough, or do we have to delete it 
> for the preemption to then result in scheduling?
> We'd love to get feedback as to the following
>  * Is passing a test like described above even a goal of Yunikorn preemption? 
>  * If so, how can we be more strategic about releasing reservations that 
> become major blockers, esp. in the preemption context?
>  * We don't suppose there's a simple way to opt out of the reservation 
> feature altogether is there? We don't ever want a reservation to block a 
> node. If the pod can't schedule in the current cycle, we'd like it to wait 
> without a reservation (in our case a full node will always free up at some 
> point all at once). Or is there something we're misunderstanding that makes 
> us need reservations?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to