[
https://issues.apache.org/jira/browse/YUNIKORN-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767200#comment-17767200
]
Rainie Li edited comment on YUNIKORN-1988 at 9/20/23 5:04 PM:
--------------------------------------------------------------
[~ccondit] Pls take a look [https://github.com/apache/yunikorn-core/pull/660
|https://github.com/apache/yunikorn-core/pull/660,]
I will update the pr to include related tests.
was (Author: rainieli):
[~ccondit] Pls take a look [https://github.com/apache/yunikorn-core/pull/660
|https://github.com/apache/yunikorn-core/pull/660,]
> Preemption happens when a queue lower than its guaranteed capacity
> -------------------------------------------------------------------
>
> Key: YUNIKORN-1988
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1988
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: Rainie Li
> Assignee: Rainie Li
> Priority: Critical
> Labels: pull-request-available
>
> *Background:*
> We set tier based priorityClass and using YuniKorn 1.3 with Admission
> controller in production (our prod cluster has hundreds of EKS nodes).
> Many production tier2 jobs got preempted unexpectedly. From application log,
> we saw driver pods all got shutdown around same time.
> Most failed jobs were from the same queue, we set 300G as guaranteed memory
> for queue that got preempted, all driver pods required 24G memory.
> Right now we disabled preemption feature in production to mitigate the issue.
> *Investigation:*
> Reproduced the issue on dev env, preemption can happen when a queue is lower
> than its guaranteed capacity.
> Confirmed yunikorn k8shim log: our driver pods got set as originator.
> I am investigating how to fix the issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]