[
https://issues.apache.org/jira/browse/YUNIKORN-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18032469#comment-18032469
]
Sudipto Baral commented on YUNIKORN-3137:
-----------------------------------------
[~mani]
Thanks for reviewing this. I don’t have a unit test at the moment, but I tested
it on a *single-node* kind cluster using the queue configuration and job
configuration files shared above.
{quote}We have similar checks right below the victims traversal block you
highlighted, to ensure we don’t proceed with the preemption unnecessarily.
{quote}
Specifically, at line 641:
victimsTotalResource.AddTo(victim.GetAllocatedResource())
This line is outside the finalVictims conditional block in line 637, meaning
the victim’s allocated resource is added to victimsTotalResource even if the
victim itself isn’t added to finalVictims. As a result, the shortfall check
becomes ineffective.
Reference:
https://github.com/apache/yunikorn-core/blob/7511f30539c781b30568047df20a8127b0278260/pkg/scheduler/objects/preemption.go#L637-L641
> Fails to preempt more than 2 victims for a larger ask.
> ------------------------------------------------------
>
> Key: YUNIKORN-3137
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3137
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Affects Versions: 1.6.3
> Environment: Kind
> Reporter: Sudipto Baral
> Priority: Major
> Attachments: job-child1.yaml, job-child2.yaml, queues.yaml
>
>
> h3. Problem Description
> If a large pod ({{{}ask{}}}) requires evicting multiple smaller pods to fit,
> the scheduler can only preempt up to two pods, preventing the {{ask}} from
> being scheduled even when the total ask is under the guaranteed limit.
> Reference code:
> [preemption.go#L629-L642|https://github.com/apache/yunikorn-core/blob/7511f30539c781b30568047df20a8127b0278260/pkg/scheduler/objects/preemption.go#L629-L642]
> For example, if the ask is \{vcore: 300, memory: 300, pod: 1}, and each
> victim of size \{vcore: 100, memory: 100, pod: 1}, after two iterations,
> victimsTotalResource becomes \{vcore: 200, memory: 200, pod: 2}. At this
> point, no additional victims are added to the finalVictims list due to the
> following condition:
> {code:java}
> if
> p.ask.GetAllocatedResource().StrictlyGreaterThanOnlyExisting(victimsTotalResource){code}
> As a result, only two pods are evicted (for no reason), but the freed
> resources are still insufficient for the ask, leaving the large pod
> unscheduled.
> h3. Reproduce
> Please take a look at the attachments for the job and queue configurations
> h4. Phase 1: Initial Allocation
> # {*}job-child1 → child1{*}: Request 10 pods × 100m CPU, 100Mi Memory each
> *
> ** {*}Gets{*}: 6 pods × 100m CPU, 100Mi Memory = 600m CPU, 600Mi Memory
> (cluster max)
> ** {*}Remaining{*}: 4 pods pending (400m CPU, 400Mi Memory needed)
> # {*}job-child2 → child2{*}: Request 10 pods × 300m CPU, 300Mi Memory each
> *
> ** {*}Gets{*}: 0 pods initially (no resources available)
> ** {*}Needs{*}: 300m CPU, 300Mi Memory to meet guarantee
> h4. Phase 2: Preemption Attempt for Guarantee
> # {*}Preemption for child2 guarantee{*}: Try to free 300m CPU, 300Mi Memory
> ** {*}Victims{*}: should preempt 3 pods from child1 (3 × 100m CPU, 100Mi
> Memory = 300m CPU, 300Mi Memory)
> ** {color:#de350b}Only 2 pods are actually preempted due to the condition in
> preemption.go{color}
> ** {*}Freed resources{*}: 200m CPU, 200Mi Memory (insufficient for child2
> guarantee)
> ** {*}Result{*}: child2 gets 0 pods, guarantee not met
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]