> On Oct. 22, 2017, 9:09 a.m., Benjamin Hindman wrote: > > 3rdparty/libprocess/src/process.cpp > > Lines 3291 (patched) > > <https://reviews.apache.org/r/63208/diff/1/?file=1865231#file1865231line3291> > > > > Any way we could not have to grab this reference each time we loop? > > This is in the hot path.
Yeah, I'll do that, it's also not a complete fix I realized because there's an access to `process->manage` below the loop! - Benjamin ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63208/#review188906 ----------------------------------------------------------- On Oct. 22, 2017, 3 a.m., Benjamin Mahler wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63208/ > ----------------------------------------------------------- > > (Updated Oct. 22, 2017, 3 a.m.) > > > Review request for mesos and Benjamin Hindman. > > > Bugs: MESOS-7921 > https://issues.apache.org/jira/browse/MESOS-7921 > > > Repository: mesos > > > Description > ------- > > When ProcessManager::resume moves the process into a BLOCKED > state, it's possible that a TerminateEvent is enqueued and the > process is placed back in the run queue. Another worker thread > can pick it off the run queue, process the TerminateEvent, and > delete the process! At this point, when the original thread > in ProcessManager::resume tries to check if any events were > enqueued before transitioning to BLOCKED, it will access a > deleted process and crash. > > Some example crash paths involving process::Latch below. > > Path 1: > > T1 creates latch, spawns latch process, and puts it in run queue > T1 waits on latch > T1 ProcessManager::wait on latch see it in BOTTOM > > T2 worker thread dequeues the latch process > T2 ProcessManager::resume runs initialize, moves it to READY > T2 ProcessManager::resume sees empty queue > T2 ProcessManager::resume sets to BLOCKED > > T3 triggers the latch, terminates the latch process > T3 enqueue TerminateEvent > T3 enqueue sees state BLOCKED > T3 swaps from BLOCKED->READY & enqueues latch process into run queue > > T1 extracts latch process from run queue > T1 ProcessManager::resume sees READY > T1 ProcessManager::resume dequeues terminate event > T1 ProcessManager::resume calls ProcessManager::cleanup > T1 ProcessManager::cleanup sets to TERMINATING > T1 ProcessManager::cleanup decommissions event queue > T1 ProcessManager::cleanup waits for latch refs to go away > T1 ProcessManager::cleanup calls SocketManager::exited > T1 ProcessManager::cleanup opens gate > T1 ProcessManager::cleanup deletes the latch process > > T2 ProcessManager::resume checks if event queue is empty again (crash) > > Path 2: > > T1 creates latch, spawns latch process, and puts it in run queue > T1 waits on latch > T1 ProcessManager::wait on latch see it in BOTTOM > T1 ProcessManager::wait extracts latch process from run queue > T1 ProcessManager::resume runs initialize, moves it to READY > T1 ProcessManager::resume sees empty queue > T1 ProcessManager::resume sets to BLOCKED > > T3 triggers the latch, terminates the latch process > T3 enqueue TerminateEvent > T3 enqueue sees state BLOCKED > T3 swaps from BLOCKED->READY & enqueues latch process into run queue > > T2 worker thread dequeues the latch process > T2 ProcessManager::resume sees READY > T2 ProcessManager::resume dequeues terminate event > T2 ProcessManager::resume calls ProcessManager::cleanup > T2 ProcessManager::cleanup sets to TERMINATING > T2 ProcessManager::cleanup decommissions event queue > T2 ProcessManager::cleanup waits for latch refs to go away > T2 ProcessManager::cleanup calls SocketManager::exited > T2 ProcessManager::cleanup opens gate > T2 ProcessManager::resume deletes the latch process > > T1 ProcessManager::resume checks if event queue is empty again (crash) > > > Diffs > ----- > > 3rdparty/libprocess/src/process.cpp > 4d8d483cfa5c3bac8bbe6a985f1cc2b737ae691e > > > Diff: https://reviews.apache.org/r/63208/diff/1/ > > > Testing > ------- > > * Tested by injecting a sleep > [here](https://github.com/apache/mesos/blob/1.4.0/3rdparty/libprocess/src/process.cpp#L3278). > * Ran throughput benchmark, results are highly variable across runs, see > results > [here](https://docs.google.com/spreadsheets/d/1yoQtvAWiojLWBV0TlqjCxdc1RMtNERGSEZzgCxkWoxI/edit?usp=sharing). > > > Thanks, > > Benjamin Mahler > >
