[
https://issues.apache.org/jira/browse/YUNIKORN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kinga Marton updated YUNIKORN-540:
----------------------------------
Attachment: stacktrace.txt
> Possible deadlock when recovering or deleting an allocation ask
> ---------------------------------------------------------------
>
> Key: YUNIKORN-540
> URL: https://issues.apache.org/jira/browse/YUNIKORN-540
> Project: Apache YuniKorn
> Issue Type: Bug
> Reporter: Kinga Marton
> Priority: Critical
> Attachments: stacktrace.txt
>
>
> Steps to reproduce locally the deadlock during recovery:
> # modify the sleep example to have a bigger sleep time (for example 300s),
> to make sure that the pods are still running after recovery
> # when the pods are already running stop the scheduler
> # start the scheduler in debug mode and add a breakpoint here in the
> application#RecoverAllocationAsk(ask *AllocationAsk) method here:
> [https://github.com/apache/incubator-yunikorn-core/blob/master/pkg/scheduler/objects/application.go#L400.]
> I think we need the breakpoint to make this a little bit slower than usual,
> however I tried to reproduce it in normal running mode by adding some sleep,
> but I couldn't, it came out just in debug mode. Also if I commented out the
> lock, then it disappeared.
> # Once the program will stop at the breakpoint let it go forward.
> # After this step it will hang until the node recovery times out
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]