[ 
https://issues.apache.org/jira/browse/MESOS-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855287#comment-16855287
 ] 

Benjamin Mahler commented on MESOS-9808:
----------------------------------------

Thanks for looking into this [~asekretenko]!

This can happen when a dispatch has objects that are bound into it whose 
destructors will do any of the following:
* terminate a process
* dispatch to a process using a UPID that didn't resolve to a Process upon 
construction (highly doubt we have any code doing this)
* send a message to a local Process (i.e. in the same OS process) (doubt this 
will be an issue outside of testing since we use dispatch for local components)

The issue is that we currently destruct dropped DispatchEvents to TERMINATING 
Processes while holding the TERMINATING ProcessReference (whoops!), and so we 
can execute further calls that try to block on the processes_mutex (e.g. 
terminate()) while the cleanup of the TERMINATING Process is spinning waiting 
for transient references to go away.

I'm not sure how common the terminate case above is, but it's the most 
worrying. Probably it makes sense to backport the fix to at least 1.8.x, and 
ideally further back.

I wrote a fix, and spent some time trying to test this but gave up after being 
unable to figure out how to reliably get into a deadlock state without races. 
The fix is here: https://reviews.apache.org/r/70778/

Can you let me know if it fixes the issue that you saw without your workaround?

> libprocess can deadlock on termination (cleanup() vs use() + terminate())
> -------------------------------------------------------------------------
>
>                 Key: MESOS-9808
>                 URL: https://issues.apache.org/jira/browse/MESOS-9808
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Andrei Sekretenko
>            Priority: Major
>              Labels: foundations
>         Attachments: deadlock_stacks.txt, deadlock_stacks_filtered.txt
>
>
> Using the process::loop() together with the common pattern of using 
> libprocess (Process wrapper + dispatching) is prone to causing a deadlock on 
> libprocess termination if the code does not wait for the loop exit before 
> termination.
> *The deadlock itself is not directly caused by the process::loop(), though.*
>  It occurs in a following setup with two processes (let's name them A and B).
> Thread 1 tries to cleanup process A. It locks processes_mutex and hangs here:
>  
> [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3079]
>  waiting for the process A to have no strong references.
> Thread 2 begins with creating a ProcessReference in 
> ProcessManager::deliver(UPID&) called for process: 
> [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L2799]
> and ends up waiting for processes_mutex in ProcessManager::terminate() for 
> process B:
>  
> [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3155]
> -----------------
>  In the observed case, terminate() for process B was triggered by a 
> destructor of a process-wrapping object owned by a libprocess loop executing 
> on A.
> I'm attaching the stacks captured at the deadlock. Stacks of the threads 
> which lock one another are in [^deadlock_stacks_filtered.txt] Note frame #1 
> in Thread 5 (waiting for all references to expire) and frames #48 and #8 in 
> Thread 19 (creating a reference and waiting for a processes_mutex).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to