[ 
https://issues.apache.org/jira/browse/MESOS-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151526#comment-16151526
 ] 

Benjamin Hindman commented on MESOS-7921:
-----------------------------------------

[~xujyan]: we shouldn't be able to resume if the thread is terminating or 
cleaned up because it won't be in the run queue.

I did track down at least one bug with how we do garbage collection that could 
lead to these crashes. Basically, we could have the following sequence of 
events:

(1) spawn process Foo, this enqueues a dispatch to `GarbageCollector::manage()`.
(2) terminate process Foo
(3) Execute `GarbageCollector::manage()` which calls `link()` and see's that 
the process is terminated but BEFORE enqueuing an exited event another thread 
spawns another process Foo which enqueues a dispatch to 
`GarbageCollector::manage()`. See 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L3514
 for where a thread could get stalled in order to cause this unfortunate 
ordering. 
(4) Execute `GarbageCollector::manage()` _which overwrites the process that 
we're managing_.
(5) Execute 'GarbageCollector::exited()` _which deletes the new process that 
has been spawned not the old process that has terminated_.
(6) Crash because we're using deleted memory.

Here is a patch that removes and simplifies garbage collection: 
https://reviews.apache.org/r/62053

Unfortunately I still can't trigger this bug myself, [~xujyan] can you try 
applying this patch and tell me if you still run into the crash? Thank you!

> process::EventQueue sometimes crashes
> -------------------------------------
>
>                 Key: MESOS-7921
>                 URL: https://issues.apache.org/jira/browse/MESOS-7921
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess
>    Affects Versions: 1.4.0
>         Environment: autotools,gcc,--verbose,GLOG_v=1 
> MESOS_VERBOSE=1,ubuntu:14.04,(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)
> Note that --enable-lock-free-event-queue is not enabled.
> Details: 
> https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/injectedEnvVars/
>            Reporter: Yan Xu
>            Priority: Blocker
>         Attachments: 
> FetcherCacheTest.CachedCustomOutputFileWithSubdirectory.log.txt, 
> MesosContainerizerSlaveRecoveryTest.ResourceStatisticsFullLog.txt
>
>
> The following segfault is found on 
> [ASF|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/]
>  in {{MesosContainerizerSlaveRecoveryTest.ResourceStatistics}} but it's flaky 
> and shows up in other tests and environments (with or without 
> --enable-lock-free-event-queue) as well.
> {noformat: title=Configuration}
> ./bootstrap '&&' ./configure --verbose '&&' make -j6 distcheck
> {noformat}
> {noformat:title=}
> *** Aborted at 1503937885 (unix time) try "date -d @1503937885" if you are 
> using GNU date ***
> PC: @     0x2b9e2581caa0 process::EventQueue::Consumer::empty()
> *** SIGSEGV (@0x8) received by PID 751 (TID 0x2b9e31978700) from PID 8; stack 
> trace: ***
>     @     0x2b9e29d26330 (unknown)
>     @     0x2b9e2581caa0 process::EventQueue::Consumer::empty()
>     @     0x2b9e25800a40 process::ProcessManager::resume()
>     @     0x2b9e2580f891 
> process::ProcessManager::init_threads()::$_9::operator()()
>     @     0x2b9e2580f7d5 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_9vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
>     @     0x2b9e2580f7a5 std::_Bind_simple<>::operator()()
>     @     0x2b9e2580f77c std::thread::_Impl<>::_M_run()
>     @     0x2b9e29fe5a60 (unknown)
>     @     0x2b9e29d1e184 start_thread
>     @     0x2b9e2a851ffd (unknown)
> make[3]: *** [CMakeFiles/check] Segmentation fault (core dumped)
> {noformat}
> A [email protected] query shows many such instances: 
> https://lists.apache.org/[email protected]:lte=1M:process%3A%3AEventQueue%3A%3AConsumer%3A%3Aempty



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to