[
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838732#comment-17838732
]
Craig Condit commented on YUNIKORN-2521:
----------------------------------------
[~jshmchenxi], [~pbacsko] I think this latest log (2024-04-18) might be a false
positive. TryAllocate() is in all the mentioned code paths (and takes an app
lock), and preemption requires walking other apps which takes their locks as
well. So the detector can see App A locked, then App B, and in another run, App
B first, then App A. However, there's no way for TryAllocate() to be active in
multiple goroutines (it's only ever called from the main scheduler loop). So I
think this isn't actually an issue, just something that trips up the detector.
> Scheduler deadlock
> ------------------
>
> Key: YUNIKORN-2521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
> Project: Apache YuniKorn
> Issue Type: Bug
> Affects Versions: 1.5.0
> Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
> Reporter: Noah Yoshida
> Assignee: Craig Condit
> Priority: Critical
> Attachments: 0001-YUNIKORN-2539-core.patch,
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt,
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt,
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png,
> 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out,
> goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out,
> goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out,
> goroutine-while-blocking.out, logs-potential-deadlock-2.txt,
> logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt,
> profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt,
> running-logs.txt
>
>
> Discussion on Yunikorn slack:
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting.
> All pods stay in Pending. There are no error logs inside of the Yunikorn
> scheduler indicating any issue.
> Additionally, the pods all have the correct annotations / labels from the
> admission service, so they are at least getting put into k8s correctly.
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using
> version `v1.28.6-eks-508b6b3`.
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes
> are added and removed pretty frequently as we do ML workloads.
> Attached is the goroutine dump. We were not able to get a statedump as the
> endpoint kept timing out.
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also
> have to delete any "Pending" pods that got stuck while the scheduler was
> deadlocked as well, for them to get picked up by the new scheduler pod.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]