[
https://issues.apache.org/jira/browse/YUNIKORN-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18032406#comment-18032406
]
Manikandan R commented on YUNIKORN-3137:
----------------------------------------
[~sudiptob2]
{quote}Fails to preempt more than 2 victims for a larger ask
{quote}
Can you share the unit test used to reproduce this? We do have the similar
checks just below the victims traversal block you had highlighted to ensure we
don't proceed with killing the victims in case of any shortfall. Are victims
running in different nodes? If yes, and victim node doesn't have enough space
to accommodate ask then victim would not be considered.
> Fails to preempt more than 2 victims for a larger ask.
> ------------------------------------------------------
>
> Key: YUNIKORN-3137
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3137
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Affects Versions: 1.6.3
> Environment: Kind
> Reporter: Sudipto Baral
> Priority: Major
> Attachments: job-child1.yaml, job-child2.yaml, queues.yaml
>
>
> h3. Problem Description
> If a large pod ({{{}ask{}}}) requires evicting multiple smaller pods to fit,
> the scheduler can only preempt up to two pods, preventing the {{ask}} from
> being scheduled even when the total ask is under the guaranteed limit.
> Reference code:
> [preemption.go#L629-L642|https://github.com/apache/yunikorn-core/blob/7511f30539c781b30568047df20a8127b0278260/pkg/scheduler/objects/preemption.go#L629-L642]
> For example, if the ask is \{vcore: 300, memory: 300, pod: 1}, and each
> victim of size \{vcore: 100, memory: 100, pod: 1}, after two iterations,
> victimsTotalResource becomes \{vcore: 200, memory: 200, pod: 2}. At this
> point, no additional victims are added to the finalVictims list due to the
> following condition:
> {code:java}
> if
> p.ask.GetAllocatedResource().StrictlyGreaterThanOnlyExisting(victimsTotalResource){code}
> As a result, only two pods are evicted (for no reason), but the freed
> resources are still insufficient for the ask, leaving the large pod
> unscheduled.
> h3. Reproduce
> Please take a look at the attachments for the job and queue configurations
> h4. Phase 1: Initial Allocation
> # {*}job-child1 → child1{*}: Request 10 pods × 100m CPU, 100Mi Memory each
> *
> ** {*}Gets{*}: 6 pods × 100m CPU, 100Mi Memory = 600m CPU, 600Mi Memory
> (cluster max)
> ** {*}Remaining{*}: 4 pods pending (400m CPU, 400Mi Memory needed)
> # {*}job-child2 → child2{*}: Request 10 pods × 300m CPU, 300Mi Memory each
> *
> ** {*}Gets{*}: 0 pods initially (no resources available)
> ** {*}Needs{*}: 300m CPU, 300Mi Memory to meet guarantee
> h4. Phase 2: Preemption Attempt for Guarantee
> # {*}Preemption for child2 guarantee{*}: Try to free 300m CPU, 300Mi Memory
> ** {*}Victims{*}: should preempt 3 pods from child1 (3 × 100m CPU, 100Mi
> Memory = 300m CPU, 300Mi Memory)
> ** {color:#de350b}Only 2 pods are actually preempted due to the condition in
> preemption.go{color}
> ** {*}Freed resources{*}: 200m CPU, 200Mi Memory (insufficient for child2
> guarantee)
> ** {*}Result{*}: child2 gets 0 pods, guarantee not met
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]