[ https://issues.apache.org/jira/browse/YUNIKORN-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Manikandan R reassigned YUNIKORN-3092: -------------------------------------- Assignee: Manikandan R > Reservations can permanently block nodes, leading to preemption failure and a > stuck scheduler state > --------------------------------------------------------------------------------------------------- > > Key: YUNIKORN-3092 > URL: https://issues.apache.org/jira/browse/YUNIKORN-3092 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler > Affects Versions: 1.6.3 > Reporter: John Daciuk > Assignee: Manikandan R > Priority: Minor > Labels: preemption > Attachments: Screenshot 2025-06-23 at 8.28.55 PM.png > > Original Estimate: 336h > Remaining Estimate: 336h > > h2. Context > Since deploying Yunikorn back in October 2024 we've encountered occasional > preemption misses. We find a high priority pod pending for hours, manually > delete a low priority pod, then see the high priority pod schedule. > We dug into this earlier this year and found the upgrade from 1.5.2 to 1.6.2 > helpful. In particular [this > PR|https://github.com/apache/yunikorn-core/pull/1001] achieved 100% expected > preemption in our testing due to it's reservation removal logic. However we > still find that 1.6.2 is not reliable with respect to preemption in practice. > And we can repro preemption misses by scaling up our original preemption load > test by 4x. > h2. Repro > With Yunikorn 1.6.3 schedule ~400 low priority pods that live forever and > fill up all node capacity. Once they are running, schedule the same number of > high priority pods to a different queue. Use the same resources for all the > pods. > We expect that all the high priority pods will eventually schedule. However > we find about 10% of them stuck pending. This can be seen in the screenshot > attached, where the high priority pods are tier0. > If we add logging like in [this diff from > branch-1.6|https://github.com/apache/yunikorn-core/compare/branch-1.6...jdaciuk:yunikorn-core:jdaciuk-1.6] > we see > {quote}{{{}2025-06-23T04:54:26.776Z INFO core.scheduler.preemption > objects/preemption.go:93 Removing node ip-100-76-60-239.ec2.internal from > consideration. node.IsReserved: true, node reservations: > map[847f05e1-f74c-403a-8033-154cd76d89c0:ip-100-76-60-239.ec2.internal -> > tier0-1-395-157140|847f05e1-f74c-403a-8033-154cd76d89c0], node fits ask: true > {"applicationID": "tier0-1-406-328120", "allocationKey": > "e589c683-faf1-4793-97b8-c5f3b3bc34b5", "author": "MLP"{}}}} > {quote} > A node (with tons of potential victims) ip-100-76-60-239.ec2.internal is > removed from consideration for preemption because it's reserved. Looking at > the reservation map above, we see that pod tier0-1-395 has the reservation. > The pod tier0-1-395 is blocking the entire node. Why can't it schedule and > release the reservation? > {quote}{{{}2025-06-23T04:43:45.942Z INFO core.scheduler.application > objects/application.go:1008 tryAllocate did not find a candidate > allocation in the node iterator, allowPreemption: true, > preemptAttemptsRemaining: 0 {"applicationID": "tier0-1-395-157140", > "author": "MLP"{}}}} > {quote} > Because there's no more preemption attempts allowed for the particular queue > this cycle. And unfortunately this situation repeats itself every cycle since > pod tier0-1-395 is not among the first in the queue to ever tryAllocate. > h2. Thoughts > This is one example, but there are a number of ways we can get stuck in such > a cycle. Particular to the preemption failure here, it seems like we need > some way to either remove the dead reservation or ignore it while considering > preemption victims. > So for example, when we iterate through the nodes in [this > code|https://github.com/apache/yunikorn-core/blob/master/pkg/scheduler/objects/preemption.go#L163], > perhaps we could first try with filtering out reserved nodes (as the code > is) then try another loop ignoring and/or breaking reservations if we find > victims. Would ignoring the reservation be enough, or do we have to delete it > for the preemption to then result in scheduling? > We'd love to get feedback as to the following > * Is passing a test like described above even a goal of Yunikorn preemption? > * If so, how can we be more strategic about releasing reservations that > become major blockers, esp. in the preemption context? > * We don't suppose there's a simple way to opt out of the reservation > feature altogether is there? We don't ever want a reservation to block a > node. If the pod can't schedule in the current cycle, we'd like it to wait > without a reservation (in our case a full node will always free up at some > point all at once). Or is there something we're misunderstanding that makes > us need reservations? -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org