----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63208/ -----------------------------------------------------------
Review request for mesos and Benjamin Hindman. Bugs: MESOS-7921 https://issues.apache.org/jira/browse/MESOS-7921 Repository: mesos Description ------- When ProcessManager::resume moves the process into a BLOCKED state, it's possible that a TerminateEvent is enqueued and the process is placed back in the run queue. Another worker thread can pick it off the run queue, process the TerminateEvent, and delete the process! At this point, when the original thread in ProcessManager::resume tries to check if any events were enqueued before transitioning to BLOCKED, it will access a deleted process and crash. Some example crash paths involving process::Latch below. Path 1: T1 creates latch, spawns latch process, and puts it in run queue T1 waits on latch T1 ProcessManager::wait on latch see it in BOTTOM T2 worker thread dequeues the latch process T2 ProcessManager::resume runs initialize, moves it to READY T2 ProcessManager::resume sees empty queue T2 ProcessManager::resume sets to BLOCKED T3 triggers the latch, terminates the latch process T3 enqueue TerminateEvent T3 enqueue sees state BLOCKED T3 swaps from BLOCKED->READY & enqueues latch process into run queue T1 extracts latch process from run queue T1 ProcessManager::resume sees READY T1 ProcessManager::resume dequeues terminate event T1 ProcessManager::resume calls ProcessManager::cleanup T1 ProcessManager::cleanup sets to TERMINATING T1 ProcessManager::cleanup decommissions event queue T1 ProcessManager::cleanup waits for latch refs to go away T1 ProcessManager::cleanup calls SocketManager::exited T1 ProcessManager::cleanup opens gate T1 ProcessManager::cleanup deletes the latch process T2 ProcessManager::resume checks if event queue is empty again (crash) Path 2: T1 creates latch, spawns latch process, and puts it in run queue T1 waits on latch T1 ProcessManager::wait on latch see it in BOTTOM T1 ProcessManager::wait extracts latch process from run queue T1 ProcessManager::resume runs initialize, moves it to READY T1 ProcessManager::resume sees empty queue T1 ProcessManager::resume sets to BLOCKED T3 triggers the latch, terminates the latch process T3 enqueue TerminateEvent T3 enqueue sees state BLOCKED T3 swaps from BLOCKED->READY & enqueues latch process into run queue T2 worker thread dequeues the latch process T2 ProcessManager::resume sees READY T2 ProcessManager::resume dequeues terminate event T2 ProcessManager::resume calls ProcessManager::cleanup T2 ProcessManager::cleanup sets to TERMINATING T2 ProcessManager::cleanup decommissions event queue T2 ProcessManager::cleanup waits for latch refs to go away T2 ProcessManager::cleanup calls SocketManager::exited T2 ProcessManager::cleanup opens gate T2 ProcessManager::resume deletes the latch process T1 ProcessManager::resume checks if event queue is empty again (crash) Diffs ----- 3rdparty/libprocess/src/process.cpp 4d8d483cfa5c3bac8bbe6a985f1cc2b737ae691e Diff: https://reviews.apache.org/r/63208/diff/1/ Testing ------- * Tested by injecting a sleep [here](https://github.com/apache/mesos/blob/1.4.0/3rdparty/libprocess/src/process.cpp#L3278). * Ran throughput benchmark, results are highly variable across runs, see results [here](https://docs.google.com/spreadsheets/d/1yoQtvAWiojLWBV0TlqjCxdc1RMtNERGSEZzgCxkWoxI/edit?usp=sharing). Thanks, Benjamin Mahler