[ 
https://issues.apache.org/jira/browse/YUNIKORN-2784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870632#comment-17870632
 ] 

Peter Bacsko edited comment on YUNIKORN-2784 at 8/2/24 9:31 PM:
----------------------------------------------------------------

Thanks [~dimm]. This is definitely not a deadlock but a scheduling/preemption 
issue. The scheduler constantly want to schedule a deamon set with preemption 
but it can't. 

I can see the following: it does seem to be able to schedule pods from various 
application eg. "osg-opportunistic-wn",  "yunikorn-gilpin-lab-autogen", 
"osg-icecube-wn", "yunikorn-dl4nlpspace-autogen". However, when it gets to 
"yunikorn-seaweedfs-csi-autogen", it cannot free up resource for the request 
"7fcc6c57-8296-43d2-9543-311657814d99" which is pod "csi-seaweedfs-node-5vbpx". 
 Why this affects the pods from the application "yunikorn-ipmi-autogen" is not 
clear at the moment.

But at least we have data to start with.


was (Author: pbacsko):
Thanks [~dimm]. This is definitely not a deadlock but a preemption issue. The 
scheduler constantly want to schedule a deamon set with preemption but it 
can't. 

I can see the following: it does seem to be able to schedule pods from various 
application eg. "osg-opportunistic-wn",  "yunikorn-gilpin-lab-autogen", 
"osg-icecube-wn", "yunikorn-dl4nlpspace-autogen". However, when it gets to 
"yunikorn-seaweedfs-csi-autogen", it cannot free up resource for the request 
"7fcc6c57-8296-43d2-9543-311657814d99" which is pod "csi-seaweedfs-node-5vbpx". 
 Why this affects the pods from the application "yunikorn-ipmi-autogen" is not 
clear at the moment.

But at least we have data to start with.

> Scheduler stuck
> ---------------
>
>                 Key: YUNIKORN-2784
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2784
>             Project: Apache YuniKorn
>          Issue Type: Bug
>            Reporter: Dmitry
>            Priority: Major
>         Attachments: Screenshot 2024-08-02 at 1.16.30 PM.png, Screenshot 
> 2024-08-02 at 1.20.23 PM.png, dumps.tgz, logs
>
>
> Shortly after switching to yunikorn, a bunch of tiny pods get stuck pending 
> (screenshot 1). Also all other ones, but these are the most visible and 
> should be running 100%.
> After restarting the scheduler, all get scheduled immediately (screenshot 2).
> Attaching the output of `/ws/v1/stack`, `/ws/v1/fullstatedump` and 
> `/debug/pprof/goroutine?debug=2`
> Also logs from the scheduler.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to