[ 
https://issues.apache.org/jira/browse/YUNIKORN-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850240#comment-17850240
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2646:
-------------------------------------------------

For the analysis of the stack trace. [~pbacsko] is correct, we have seen this 
before and it is a false positive.

This lock order check points towards something like the following case:
 * app A -> Allocate, trigger preemption -> check if app B can be a victim
 * app B -> Allocate, trigger preemption -> check if app A can be a victim

Two points:
 # scheduling cycle is single threaded.
 # the application triggering preemption is never a victim

So how does that relate to the stack trace: the 
{{PartitionContext.tryAllocate}} shown in the logs are never running at the 
same time. Scheduling also does not run multiple go routines. Last point is 
that leaving the {{Application.tryAllocate}} for the next cycle all locks that 
were held have been released. The next cycle could look at the same application 
again or might use a completely different one.

When building the victim list via the {{Queue.FindEligiblePreemptionVictims}} 
and the recursive version of that call the queue from the application that 
triggered the preemption is filtered out. The lock held in 
{{Application.tryAllocate}} is on an application that cannot be later selected 
as a victim. If that would occur scheduling would immediately stop at that 
point. We would never see a second instance of this stack trace in the deadlock 
logging. The lock taken on the application for scheduling is a write lock. 
Getting a read lock on the same application would block.

We need to investigate how we can exclude this from the potential deadlock 
detection. The only optin I can find at the moment is setting 
{{Opts.DisableLockOrderDetection}} for the detection code if you want to run 
this with preemption turned on.

> Deadlock detected during preemption
> -----------------------------------
>
>                 Key: YUNIKORN-2646
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2646
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Dmitry
>            Assignee: Peter Bacsko
>            Priority: Major
>         Attachments: yunikorn-logs-lock.txt.gz
>
>
> Hitting deadlocks in 1.5.1
> The log is attached



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to