[ 
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834392#comment-17834392
 ] 

Craig Condit commented on YUNIKORN-2521:
----------------------------------------

For those affected, would anyone be willing to build a custom version with the 
following patches against 1.5.0 and attempt to reproduce locally?

Apply to yunikorn-core v1.5.0:
 - [^0001-YUNIKORN-2539-core.patch]

Apply to yunikorn-k8shim v1.5.0:
 - [^0002-YUNIKORN-2539-k8shim.patch]

Once the patches are applied, you will need to update the go dependencies in 
yunikorn-k8shim to point to your modified yunikorn-core. Assuming both 
repositories are checked out to parallel directories, this should work:
{code:java}
# Use local yunikorn-core
$ go mod edit -replace github.com/apache/yunikorn-core=../yunikorn-core

# Fixup go.mod / go.sum references to transitive dependencies
$ go mod tidy
{code}
This patch replaces sync.\{RW}Mutex with internal locking.\{RW}Mutex 
implementations. The new implementation wraps the go-deadlock library with 
logic to conditionally enable deadlock detection based on the presence of 
environment variables:

To enable the feature:
 - DEADLOCK_DETECTION_ENABLED=true

To customize the timeout before potential deadlocks are logged (default is 60 
seconds):
 - DEADLOCK_TIMEOUT_SECONDS=60

When deploying to Kubernetes, DEADLOCK_DETECTION_ENABLED=true must be set in 
the yunikorn-scheduler container:
{code:java}
...

spec:
  containers:
  - env:
    - name: DEADLOCK_DETECTION_ENABLED
      value: "true"
    - name: NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
...
{code}
If the deployment succeeds, you should see a message logged immediately on 
startup that deadlock detection is enabled. A full state dump will also 
indicate this under the configuration section. If a deadlock is detected, a 
message will be logged, giving the stack traces of all involved goroutines. 
Hopefully this should help narrow down the root cause of this issue more easily.

> Scheduler deadlock
> ------------------
>
>                 Key: YUNIKORN-2521
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
>             Project: Apache YuniKorn
>          Issue Type: Bug
>    Affects Versions: 1.5.0
>         Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
>            Reporter: Noah Yoshida
>            Assignee: Craig Condit
>            Priority: Critical
>         Attachments: 0001-YUNIKORN-2539-core.patch, 
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, 
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, 
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, 
> 4_4_scheduler-logs.txt, goroutine-4-3-1.out, goroutine-4-3-2.out, 
> goroutine-4-3-3.out, goroutine-4-3.out, goroutine-dump.txt, 
> goroutine-while-blocking-2.out, goroutine-while-blocking.out, profile012.gif, 
> profile013.gif
>
>
> Discussion on Yunikorn slack: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
> All pods stay in Pending. There are no error logs inside of the Yunikorn 
> scheduler indicating any issue. 
> Additionally, the pods all have the correct annotations / labels from the 
> admission service, so they are at least getting put into k8s correctly. 
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using 
> version `v1.28.6-eks-508b6b3`. 
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes 
> are added and removed pretty frequently as we do ML workloads. 
> Attached is the goroutine dump. We were not able to get a statedump as the 
> endpoint kept timing out. 
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
> have to delete any "Pending" pods that got stuck while the scheduler was 
> deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to