-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70778/
-----------------------------------------------------------
Review request for mesos, Andrei Sekretenko and Chun-Hung Hsiao.
Bugs: MESOS-9808
https://issues.apache.org/jira/browse/MESOS-9808
Repository: mesos
Description
-------
Per MESOS-9808, when we deliver a DispatchEvent to a TERMINATING
Process, the DispatchEvent is destructed while holding a
TERMINATING process reference. This can lead to a deadlock
because the DispatchEvent destruction can run additional code
(e.g. destructors of objects that have been bound into the
dispatch) that ultimately invokes terminate (or dispatch, or send)
while the original TERMINATING process reference is still held. If
the TERMINATING process referenced is being cleaned up, we may
deadlock between (1) `ProcessManager::cleanup()` holding the
`processes_mutex` and spinning for all references to go away and
(2) the terminate (or dispatch, or send) in ProcessManager::use
trying to resolve the destination Process by locking
`processes_mutex`.
Example from MESOS-9808:
Thread 19:
Running ReaderProcess
ReaderProcess::_consume, sets a Promise
Dispatches to mock API subscriber Process. **It's TERMINATING,
so DispatchEvent gets deleted while holding a reference to
the mock api subscriber!**
Destructs the process::Loop in mock API subscriber Process
Deletes the Reader, Reader terminates ReaderProcess
**Trying to lock `processes_mutex` in `ProcessManager::use()`
of ReaderProcess (deadlock)**
Thread 5:
ProcessManager::cleanup of Mock API Subscriber Process,
looping on references to go away (spinning forever)
Here, thread 5 is holding `processes_mutex` locked and spinning
waiting for references to the Mock API Subscriber Process to go
away, but thread 19 is blocked also trying to lock `processes_mutex`
while holding a reference to the Mock API Subscriber Process!
The fix here is to not allow arbitrary code to execute while a
ProcessReference is held. This is already done in other locations
but this spot was missed. The reason we haven't hit this bug yet
is that it's rather uncommon for a dispatch to have bound
objects whose destructors lead to a `terminate()` call (today
`terminate()` always resolves the process reference, even if
it's already resolved, whereas `dispatch()` uses cached process
references).
Diffs
-----
3rdparty/libprocess/include/process/process.hpp
7c255acc21c695eba37062a3dcf72ce33f650cd0
3rdparty/libprocess/src/process.cpp a8d9151c91892b55d31cc7c3fb4c7243c41d9d20
Diff: https://reviews.apache.org/r/70778/diff/1/
Testing
-------
Thanks,
Benjamin Mahler