[
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Craig Condit resolved YUNIKORN-2521.
------------------------------------
Fix Version/s: 1.6.0
1.5.1
Target Version: 1.6.0, 1.5.1
Resolution: Fixed
This was delivered as part of YUNIKORN-2544.
> Scheduler deadlock
> ------------------
>
> Key: YUNIKORN-2521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
> Project: Apache YuniKorn
> Issue Type: Bug
> Affects Versions: 1.5.0
> Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
> Reporter: Noah Yoshida
> Assignee: Craig Condit
> Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
> Attachments: 0001-YUNIKORN-2539-core.patch,
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt,
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt,
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png,
> 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out,
> goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out,
> goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out,
> goroutine-while-blocking.out, logs-potential-deadlock-2.txt,
> logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt,
> profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt,
> running-logs.txt
>
>
> Discussion on Yunikorn slack:
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting.
> All pods stay in Pending. There are no error logs inside of the Yunikorn
> scheduler indicating any issue.
> Additionally, the pods all have the correct annotations / labels from the
> admission service, so they are at least getting put into k8s correctly.
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using
> version `v1.28.6-eks-508b6b3`.
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes
> are added and removed pretty frequently as we do ML workloads.
> Attached is the goroutine dump. We were not able to get a statedump as the
> endpoint kept timing out.
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also
> have to delete any "Pending" pods that got stuck while the scheduler was
> deadlocked as well, for them to get picked up by the new scheduler pod.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]