[
https://issues.apache.org/jira/browse/MESOS-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151526#comment-16151526
]
Benjamin Hindman commented on MESOS-7921:
-----------------------------------------
[~xujyan]: we shouldn't be able to resume if the thread is terminating or
cleaned up because it won't be in the run queue.
I did track down at least one bug with how we do garbage collection that could
lead to these crashes. Basically, we could have the following sequence of
events:
(1) spawn process Foo, this enqueues a dispatch to `GarbageCollector::manage()`.
(2) terminate process Foo
(3) Execute `GarbageCollector::manage()` which calls `link()` and see's that
the process is terminated but BEFORE enqueuing an exited event another thread
spawns another process Foo which enqueues a dispatch to
`GarbageCollector::manage()`. See
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L3514
for where a thread could get stalled in order to cause this unfortunate
ordering.
(4) Execute `GarbageCollector::manage()` _which overwrites the process that
we're managing_.
(5) Execute 'GarbageCollector::exited()` _which deletes the new process that
has been spawned not the old process that has terminated_.
(6) Crash because we're using deleted memory.
Here is a patch that removes and simplifies garbage collection:
https://reviews.apache.org/r/62053
Unfortunately I still can't trigger this bug myself, [~xujyan] can you try
applying this patch and tell me if you still run into the crash? Thank you!
> process::EventQueue sometimes crashes
> -------------------------------------
>
> Key: MESOS-7921
> URL: https://issues.apache.org/jira/browse/MESOS-7921
> Project: Mesos
> Issue Type: Bug
> Components: libprocess
> Affects Versions: 1.4.0
> Environment: autotools,gcc,--verbose,GLOG_v=1
> MESOS_VERBOSE=1,ubuntu:14.04,(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)
> Note that --enable-lock-free-event-queue is not enabled.
> Details:
> https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/injectedEnvVars/
> Reporter: Yan Xu
> Priority: Blocker
> Attachments:
> FetcherCacheTest.CachedCustomOutputFileWithSubdirectory.log.txt,
> MesosContainerizerSlaveRecoveryTest.ResourceStatisticsFullLog.txt
>
>
> The following segfault is found on
> [ASF|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/]
> in {{MesosContainerizerSlaveRecoveryTest.ResourceStatistics}} but it's flaky
> and shows up in other tests and environments (with or without
> --enable-lock-free-event-queue) as well.
> {noformat: title=Configuration}
> ./bootstrap '&&' ./configure --verbose '&&' make -j6 distcheck
> {noformat}
> {noformat:title=}
> *** Aborted at 1503937885 (unix time) try "date -d @1503937885" if you are
> using GNU date ***
> PC: @ 0x2b9e2581caa0 process::EventQueue::Consumer::empty()
> *** SIGSEGV (@0x8) received by PID 751 (TID 0x2b9e31978700) from PID 8; stack
> trace: ***
> @ 0x2b9e29d26330 (unknown)
> @ 0x2b9e2581caa0 process::EventQueue::Consumer::empty()
> @ 0x2b9e25800a40 process::ProcessManager::resume()
> @ 0x2b9e2580f891
> process::ProcessManager::init_threads()::$_9::operator()()
> @ 0x2b9e2580f7d5
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_9vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x2b9e2580f7a5 std::_Bind_simple<>::operator()()
> @ 0x2b9e2580f77c std::thread::_Impl<>::_M_run()
> @ 0x2b9e29fe5a60 (unknown)
> @ 0x2b9e29d1e184 start_thread
> @ 0x2b9e2a851ffd (unknown)
> make[3]: *** [CMakeFiles/check] Segmentation fault (core dumped)
> {noformat}
> A [email protected] query shows many such instances:
> https://lists.apache.org/[email protected]:lte=1M:process%3A%3AEventQueue%3A%3AConsumer%3A%3Aempty
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)