[ 
https://issues.apache.org/jira/browse/YUNIKORN-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767200#comment-17767200
 ] 

Rainie Li commented on YUNIKORN-1988:
-------------------------------------

[~ccondit] Pls take a look [https://github.com/apache/yunikorn-core/pull/660 
|https://github.com/apache/yunikorn-core/pull/660,]

> Preemption happens when a queue lower than its guaranteed capacity 
> -------------------------------------------------------------------
>
>                 Key: YUNIKORN-1988
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1988
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Rainie Li
>            Assignee: Rainie Li
>            Priority: Critical
>              Labels: pull-request-available
>
> *Background:* 
> We set tier based priorityClass and using YuniKorn 1.3 with Admission 
> controller in production (our prod cluster has hundreds of EKS nodes). 
> Many production tier2 jobs got preempted unexpectedly. From application log, 
> we saw driver pods all got shutdown around same time. 
> Most failed jobs were from the same queue, we set 300G as guaranteed memory 
> for queue that got preempted, all driver pods required 24G memory. 
> Right now we disabled preemption feature in production to mitigate the issue.
> *Investigation:* 
> Reproduced the issue on dev env, preemption can happen when a queue is lower 
> than its guaranteed capacity.
> Confirmed yunikorn k8shim log: our driver pods got set as originator. 
> I am investigating how to fix the issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to