[
https://issues.apache.org/jira/browse/YUNIKORN-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850240#comment-17850240
]
Wilfred Spiegelenburg commented on YUNIKORN-2646:
-------------------------------------------------
For the analysis of the stack trace. [~pbacsko] is correct, we have seen this
before and it is a false positive.
This lock order check points towards something like the following case:
* app A -> Allocate, trigger preemption -> check if app B can be a victim
* app B -> Allocate, trigger preemption -> check if app A can be a victim
Two points:
# scheduling cycle is single threaded.
# the application triggering preemption is never a victim
So how does that relate to the stack trace: the
{{PartitionContext.tryAllocate}} shown in the logs are never running at the
same time. Scheduling also does not run multiple go routines. Last point is
that leaving the {{Application.tryAllocate}} for the next cycle all locks that
were held have been released. The next cycle could look at the same application
again or might use a completely different one.
When building the victim list via the {{Queue.FindEligiblePreemptionVictims}}
and the recursive version of that call the queue from the application that
triggered the preemption is filtered out. The lock held in
{{Application.tryAllocate}} is on an application that cannot be later selected
as a victim. If that would occur scheduling would immediately stop at that
point. We would never see a second instance of this stack trace in the deadlock
logging. The lock taken on the application for scheduling is a write lock.
Getting a read lock on the same application would block.
We need to investigate how we can exclude this from the potential deadlock
detection. The only optin I can find at the moment is setting
{{Opts.DisableLockOrderDetection}} for the detection code if you want to run
this with preemption turned on.
> Deadlock detected during preemption
> -----------------------------------
>
> Key: YUNIKORN-2646
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2646
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: Dmitry
> Assignee: Peter Bacsko
> Priority: Major
> Attachments: yunikorn-logs-lock.txt.gz
>
>
> Hitting deadlocks in 1.5.1
> The log is attached
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]