Benjamin Mahler created MESOS-457:
-------------------------------------
Summary: Killing the slave while forked can cause the forked slave
to deadlock.
Key: MESOS-457
URL: https://issues.apache.org/jira/browse/MESOS-457
Project: Mesos
Issue Type: Bug
Reporter: Benjamin Mahler
This is related to MESOS-393.
Was discovered on a CentOS Linux machine in production.
A kill was issued to the slave while forked doing executor launching, and then
the child slave remained running deadlocked in the following location:
$ ps aux | grep mesos-slave
bmahler 13626 0.0 0.0 61224 784 pts/1 S+ 21:28 0:00 grep
mesos-slave
root 48629 0.0 2.1 1156480 535644 ? S Apr29 0:00
/usr/local/sbin/mesos-slave --port=5051
$ gdb -p 48629
(gdb) where
#0 0x00007f7612e484c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f7612e43e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x00007f7612e43cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007f7611d8a990 in std::locale::locale() () from
/usr/lib64/libstdc++.so.6
#4 0x00007f7613694a4d in basic_ostringstream (this=0x80) at
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/basic_ios.h:446
#5 process::UPID::operator std::string (this=0x80) at
../../../third_party/libprocess/src/pid.cpp:59
#6 0x00007f761359febe in
mesos::internal::slave::CgroupsIsolator::launchExecutor (this=0x7f7600005690,
slaveId=..., frameworkId=..., frameworkInfo=..., executorInfo=..., uuid=...,
directory=..., resources=...)
at ../../src/slave/cgroups_isolator.cpp:578
#7 0x00007f76134a2582 in operator()<mesos::internal::slave::Isolator*>
(__functor=<value optimized out>, __a1=0x7f7600005690)
at
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/tr1/functional_iterate.h:214
#8 std::tr1::_Function_handler<void
()(mesos::internal::slave::Isolator*),std::tr1::_Bind<std::tr1::_Mem_fn<void
(mesos::internal::slave::Isolator::*)(const mesos::SlaveID&, const
mesos::FrameworkID&, const mesos::FrameworkInfo&, const mesos::ExecutorInfo&,
const UUID&, const std::basic_string<char, std::char_traits<char>,
std::allocator<char> >&, const mesos::internal::Resources&)>
()(std::tr1::_Placeholder<1>, mesos::SlaveID, mesos::FrameworkID,
mesos::FrameworkInfo, mesos::ExecutorInfo, UUID, std::basic_string<char,
std::char_traits<char>, std::allocator<char> >, mesos::internal::Resources)>
>::_M_invoke(const std::tr1::_Any_data &, mesos::internal::slave::Isolator *)
(__functor=<value optimized out>, __a1=0x7f7600005690) at
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/tr1/functional_iterate.h:502
#9 0x00007f76134ab4b4 in operator()<process::ProcessBase*> (__functor=<value
optimized out>, __a1=0x7f76000058b0) at
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/tr1/bind_iterate.h:45
#10 std::tr1::_Function_handler<void
()(process::ProcessBase*),std::tr1::_Bind<void (* ()(std::tr1::_Placeholder<1>,
std::tr1::shared_ptr<std::tr1::function<void
()(mesos::internal::slave::Isolator*)> >))(process::ProcessBase*,
std::tr1::shared_ptr<std::tr1::function<void
()(mesos::internal::slave::Isolator*)> >)> >::_M_invoke(const
std::tr1::_Any_data &, process::ProcessBase *) (__functor=<value optimized out>,
__a1=0x7f76000058b0) at
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/tr1/functional_iterate.h:502
#11 0x00007f76136a799a in process::ProcessManager::resume (this=0x19cc6c0,
process=0x7f76000058b0) at ../../../third_party/libprocess/src/process.cpp:2432
#12 0x00007f76136a89af in process::schedule (arg=<value optimized out>) at
../../../third_party/libprocess/src/process.cpp:1167
#13 0x00007f7612e4173d in start_thread () from /lib64/libpthread.so.0
#14 0x00007f7611825f6d in clone () from /lib64/libc.so.6
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira