-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63208/
-----------------------------------------------------------

Review request for mesos and Benjamin Hindman.


Bugs: MESOS-7921
    https://issues.apache.org/jira/browse/MESOS-7921


Repository: mesos


Description
-------

When ProcessManager::resume moves the process into a BLOCKED
state, it's possible that a TerminateEvent is enqueued and the
process is placed back in the run queue. Another worker thread
can pick it off the run queue, process the TerminateEvent, and
delete the process! At this point, when the original thread
in ProcessManager::resume tries to check if any events were
enqueued before transitioning to BLOCKED, it will access a
deleted process and crash.

Some example crash paths involving process::Latch below.

Path 1:

T1 creates latch, spawns latch process, and puts it in run queue
T1 waits on latch
T1 ProcessManager::wait on latch see it in BOTTOM

T2 worker thread dequeues the latch process
T2 ProcessManager::resume runs initialize, moves it to READY
T2 ProcessManager::resume sees empty queue
T2 ProcessManager::resume sets to BLOCKED

T3 triggers the latch, terminates the latch process
T3 enqueue TerminateEvent
T3 enqueue sees state BLOCKED
T3 swaps from BLOCKED->READY & enqueues latch process into run queue

T1 extracts latch process from run queue
T1 ProcessManager::resume sees READY
T1 ProcessManager::resume dequeues terminate event
T1 ProcessManager::resume calls ProcessManager::cleanup
T1 ProcessManager::cleanup sets to TERMINATING
T1 ProcessManager::cleanup decommissions event queue
T1 ProcessManager::cleanup waits for latch refs to go away
T1 ProcessManager::cleanup calls SocketManager::exited
T1 ProcessManager::cleanup opens gate
T1 ProcessManager::cleanup deletes the latch process

T2 ProcessManager::resume checks if event queue is empty again (crash)

Path 2:

T1 creates latch, spawns latch process, and puts it in run queue
T1 waits on latch
T1 ProcessManager::wait on latch see it in BOTTOM
T1 ProcessManager::wait extracts latch process from run queue
T1 ProcessManager::resume runs initialize, moves it to READY
T1 ProcessManager::resume sees empty queue
T1 ProcessManager::resume sets to BLOCKED

T3 triggers the latch, terminates the latch process
T3 enqueue TerminateEvent
T3 enqueue sees state BLOCKED
T3 swaps from BLOCKED->READY & enqueues latch process into run queue

T2 worker thread dequeues the latch process
T2 ProcessManager::resume sees READY
T2 ProcessManager::resume dequeues terminate event
T2 ProcessManager::resume calls ProcessManager::cleanup
T2 ProcessManager::cleanup sets to TERMINATING
T2 ProcessManager::cleanup decommissions event queue
T2 ProcessManager::cleanup waits for latch refs to go away
T2 ProcessManager::cleanup calls SocketManager::exited
T2 ProcessManager::cleanup opens gate
T2 ProcessManager::resume deletes the latch process

T1 ProcessManager::resume checks if event queue is empty again (crash)


Diffs
-----

  3rdparty/libprocess/src/process.cpp 4d8d483cfa5c3bac8bbe6a985f1cc2b737ae691e 


Diff: https://reviews.apache.org/r/63208/diff/1/


Testing
-------

* Tested by injecting a sleep 
[here](https://github.com/apache/mesos/blob/1.4.0/3rdparty/libprocess/src/process.cpp#L3278).
* Ran throughput benchmark, results are highly variable across runs, see 
results 
[here](https://docs.google.com/spreadsheets/d/1yoQtvAWiojLWBV0TlqjCxdc1RMtNERGSEZzgCxkWoxI/edit?usp=sharing).


Thanks,

Benjamin Mahler

Reply via email to