----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70778/#review215721 -----------------------------------------------------------
3rdparty/libprocess/include/process/process.hpp Lines 420-421 (original), 420-423 (patched) <https://reviews.apache.org/r/70778/#comment302534> In order to help prevent issues in the future, consider expanding this comment to note that for an interface which enqueues events into a process's queue and takes ownership of the event in all cases, callers can go through the process manager via `ProcessManager::deliver()`. (Although the `deliver()` interface is a bit ambiguous, since we have two overloads, one of which deletes un-enqueued events, the other of which does not. 3rdparty/libprocess/src/process.cpp Lines 377-380 (original), 377-380 (patched) <https://reviews.apache.org/r/70778/#comment302533> Similar to the note on `ProcessBase::enqueue()` saying that callers retain ownership of un-enqueued events, could you add a comment here noting that the process manager takes ownership of the event passed into `deliver()`? 3rdparty/libprocess/src/process.cpp Line 1594 (original), 1594 (patched) <https://reviews.apache.org/r/70778/#comment302535> Does this change have the desired effect? Since it's calling into the `deliver()` overload which accepts a `ProcessBase*`, the event won't be deleted if it's not successfully enqueued? - Greg Mann On June 4, 2019, 8:09 p.m., Benjamin Mahler wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70778/ > ----------------------------------------------------------- > > (Updated June 4, 2019, 8:09 p.m.) > > > Review request for mesos, Andrei Sekretenko and Chun-Hung Hsiao. > > > Bugs: MESOS-9808 > https://issues.apache.org/jira/browse/MESOS-9808 > > > Repository: mesos > > > Description > ------- > > Per MESOS-9808, when we deliver a DispatchEvent to a TERMINATING > Process, the DispatchEvent is destructed while holding a > TERMINATING process reference. This can lead to a deadlock > because the DispatchEvent destruction can run additional code > (e.g. destructors of objects that have been bound into the > dispatch) that ultimately invokes terminate (or dispatch, or send) > while the original TERMINATING process reference is still held. If > the TERMINATING process referenced is being cleaned up, we may > deadlock between (1) `ProcessManager::cleanup()` holding the > `processes_mutex` and spinning for all references to go away and > (2) the terminate (or dispatch, or send) in ProcessManager::use > trying to resolve the destination Process by locking > `processes_mutex`. > > Example from MESOS-9808: > > Thread 19: > Running ReaderProcess > ReaderProcess::_consume, sets a Promise > Dispatches to mock API subscriber Process. **It's TERMINATING, > so DispatchEvent gets deleted while holding a reference to > the mock api subscriber!** > Destructs the process::Loop in mock API subscriber Process > Deletes the Reader, Reader terminates ReaderProcess > **Trying to lock `processes_mutex` in `ProcessManager::use()` > of ReaderProcess (deadlock)** > > Thread 5: > ProcessManager::cleanup of Mock API Subscriber Process, > looping on references to go away (spinning forever) > > Here, thread 5 is holding `processes_mutex` locked and spinning > waiting for references to the Mock API Subscriber Process to go > away, but thread 19 is blocked also trying to lock `processes_mutex` > while holding a reference to the Mock API Subscriber Process! > > The fix here is to not allow arbitrary code to execute while a > ProcessReference is held. This is already done in other locations > but this spot was missed. The reason we haven't hit this bug yet > is that it's rather uncommon for a dispatch to have bound > objects whose destructors lead to a `terminate()` call (today > `terminate()` always resolves the process reference, even if > it's already resolved, whereas `dispatch()` uses cached process > references). > > > Diffs > ----- > > 3rdparty/libprocess/include/process/process.hpp > 7c255acc21c695eba37062a3dcf72ce33f650cd0 > 3rdparty/libprocess/src/event_queue.hpp > 999d5520dc5824e47ba5cf676cf28cb6ca37ceb2 > 3rdparty/libprocess/src/process.cpp > a8d9151c91892b55d31cc7c3fb4c7243c41d9d20 > > > Diff: https://reviews.apache.org/r/70778/diff/2/ > > > Testing > ------- > > > Thanks, > > Benjamin Mahler > >
