Jie Yu created MESOS-1404:
-----------------------------
Summary: 'execute()' in mesos_containerizer.cpp is not async
signal safe
Key: MESOS-1404
URL: https://issues.apache.org/jira/browse/MESOS-1404
Project: Mesos
Issue Type: Bug
Reporter: Jie Yu
This is due to 'fork()' is not implemented async signal safe in glibc, although
according to Posix, it should be. When the child tries to execute commands
returned from isolator prepare(), it will use os::system which uses 'fork'.
I observed this stack trace when I debug a deadlock:
{noformat}
(gdb) bt
#0 0x00007f8fb2d5d2ce in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007f8fb2ce1d8e in _L_lock_44 () from /lib64/libc.so.6
#2 0x00007f8fb2cdab4c in ptmalloc_lock_all () from /lib64/libc.so.6
#3 0x00007f8fb2d11d65 in fork () from /lib64/libc.so.6
#4 0x00007f8fb4e898de in system (command=..., directory=<value optimized out>,
envp=..., uid=0, gid=0, redirectIO=<value optimized out>, pipeRead=29,
pipeWrite=30,
commands=std::list = {...}) at
../../../mesos/3rdparty/libprocess/3rdparty/stout/include/stout/os.hpp:558
#5 mesos::internal::slave::execute (command=..., directory=<value optimized
out>, envp=..., uid=0, gid=0, redirectIO=<value optimized out>, pipeRead=29,
pipeWrite=30,
commands=std::list = {...}) at
../../../mesos/src/slave/containerizer/mesos_containerizer.cpp:483
#6 0x00007f8fb4e97bab in __call<, 0, 1, 2, 3, 4, 5, 6, 7, 8> (__functor=<value
optimized out>)
at
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/tr1_impl/functional:1137
#7 operator()<> (__functor=<value optimized out>) at
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/tr1_impl/functional:1191
#8 std::tr1::_Function_handler<int(), std::tr1::_Bind<int
(*(mesos::CommandInfo, std::basic_string<char, std::char_traits<char>,
std::allocator<char> >, os::ExecEnv, unsigned int, unsigned int, bool, int,
int, std::list<Option<mesos::CommandInfo>,
std::allocator<Option<mesos::CommandInfo> > >))(const mesos::CommandInfo&,
const std::string&, const os::ExecEnv&, uid_t, gid_t, bool, int, int, const
std::list<Option<mesos::CommandInfo>, std::allocator<Option<mesos::CommandInfo>
> >&)> >::_M_invoke(const std::tr1::_Any_data &) (__functor=<value optimized
out>) at
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/tr1_impl/functional:1654
#9 0x00007f8fb4fcaebe in mesos::internal::slave::_childMain(const
std::tr1::function<int()> &, int *) (childFunction=..., pipes=0x7f8fad4f0040)
at ../../../mesos/src/slave/containerizer/linux_launcher.cpp:193
#10 0x00007f8fb2d4db6d in clone () from /lib64/libc.so.6
(gdb) info thread
* 1 Thread 0x7f8fad4f1700 (LWP 62980) 0x00007f8fb2d5d2ce in
__lll_lock_wait_private () from /lib64/libc.so.6
{noformat}
This stack trace matches the stack trace that has been discussed in glibc issue
tracker:
https://sourceware.org/bugzilla/show_bug.cgi?id=4737
And they marked this issue as "WON'T FIX". Here is some discussion:
{noformat}
The Austin group met yesterday and retained the decision to interpret fork as
async-signal-unsafe with future specifications mandating that posix_spawn be
made async-signal-safe to fill the functionality gap. Minutes of the meeting
are available at https://www.opengroup.org/austin/docs/austin_446.txt.
I think this bug can now be closed as "WONTFIX"
{noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)