[ 
https://issues.apache.org/jira/browse/YUNIKORN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton closed YUNIKORN-540.
---------------------------------
    Resolution: Not A Bug

We don't have any deadlock in production.

The deadlock was caused by a watch I had for sa.GetAllocatedResource

> Possible deadlock when recovering or deleting an allocation ask
> ---------------------------------------------------------------
>
>                 Key: YUNIKORN-540
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-540
>             Project: Apache YuniKorn
>          Issue Type: Bug
>            Reporter: Kinga Marton
>            Priority: Critical
>         Attachments: stacktrace.txt
>
>
> Steps to reproduce locally the deadlock during recovery:
>  # modify the sleep example to have a bigger sleep time (for example 300s), 
> to make sure that the pods are still running after recovery
>  # when the pods are already running stop the scheduler
>  # start the scheduler in debug mode and add a breakpoint here in the 
> application#RecoverAllocationAsk(ask *AllocationAsk) method here: 
> [https://github.com/apache/incubator-yunikorn-core/blob/master/pkg/scheduler/objects/application.go#L400.]
>  I think we need the breakpoint to make this a little bit slower than usual, 
> however I tried to reproduce it in normal running mode by adding some sleep, 
> but I couldn't, it came out just in debug mode. Also if I commented out the 
> lock, then it disappeared. 
>  # Once the program will stop at the breakpoint let it go forward.
>  # After this step it will hang until the node recovery times out



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to