[
https://issues.apache.org/jira/browse/YUNIKORN-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767328#comment-17767328
]
Rainie Li commented on YUNIKORN-1988:
-------------------------------------
I will open a new PR since I found a few potential issues.
> Preemption happens when a queue lower than its guaranteed capacity
> -------------------------------------------------------------------
>
> Key: YUNIKORN-1988
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1988
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: Rainie Li
> Assignee: Rainie Li
> Priority: Critical
> Labels: pull-request-available
>
> *Background:*
> We set tier based priorityClass and using YuniKorn 1.3 with Admission
> controller in production (our prod cluster has hundreds of EKS nodes).
> Many production tier2 jobs got preempted unexpectedly. From application log,
> we saw driver pods all got shutdown around same time.
> Most failed jobs were from the same queue, we set 300G as guaranteed memory
> for queue that got preempted, all driver pods required 24G memory.
> Right now we disabled preemption feature in production to mitigate the issue.
> *Investigation:*
> Reproduced the issue on dev env, preemption can happen when a queue is lower
> than its guaranteed capacity.
> Confirmed yunikorn k8shim log: our driver pods got set as originator.
> I am investigating how to fix the issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]