Benjamin Hindman created MESOS-190:
--------------------------------------
Summary: Slave seg fault when executor exited
Key: MESOS-190
URL: https://issues.apache.org/jira/browse/MESOS-190
Project: Mesos
Issue Type: Bug
Reporter: Benjamin Hindman
Assignee: Vinod Kone
Priority: Blocker
When I restart/kill early or otherwise interrupt my framework from the
client, I often segfault the slave. I'm not sure if there is a bug in
my executor, but it seems Mesos should be more resilient than this.
Mesos subversion -r 1331158
I know optimized builds can be tricky to debug, but in this case it
does look like it was trying to dereference the invalid Task* address
(note that task matches %rdx, and the crashed assembly code is trying
to dereference %rdx).
Any suggestions?
(gdb) bt
#0 mesos::internal::slave::Slave::executorExited (this=0x1305820,
frameworkId=..., executorId=..., status=0) at slave/slave.cpp:1400
#1 0x00007f0cf310526d in __call<process::ProcessBase*&, 0, 1> (__args=...,
this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153
#2 operator()<process::ProcessBase*> (this=<optimized out>)
at /usr/include/c++/4.6/tr1/functional:1207
#3 std::tr1::_Function_handler<void (process::ProcessBase*),
std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>,
std::tr1::shared_ptr<std::tr1::function<void
(mesos::internal::slave::Slave*)> >))(process::ProcessBase*,
std::tr1::shared_ptr<std::tr1::function<void
(mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data
const&, process::ProcessBase*) (__functor=...,
__args#0=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1684
#4 0x00007f0cf32014a3 in std::tr1::function<void
(process::ProcessBase*)>::operator()(process::ProcessBase*) const ()
from /home/ubuntu/cr/lib/libmesos-0.9.0.so
#5 0x00007f0cf31f617f in
process::ProcessBase::visit(process::DispatchEvent const&) () from
/home/ubuntu/cr/lib/libmesos-0.9.0.so
#6 0x00007f0cf31f885c in
process::DispatchEvent::visit(process::EventVisitor*) const () from
/home/ubuntu/cr/lib/libmesos-0.9.0.so
#7 0x00007f0cf31f38cf in
process::ProcessManager::resume(process::ProcessBase*) () from
/home/ubuntu/cr/lib/libmesos-0.9.0.so
#8 0x00007f0cf31ec783 in process::schedule(void*) ()
from /home/ubuntu/cr/lib/libmesos-0.9.0.so
#9 0x00007f0cf26e5e9a in start_thread ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#10 0x00007f0cf24134bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x0000000000000000 in ?? ()
(gdb) print task
$1 = (mesos::internal::Task *) 0x3031406576616c73
(gdb) info register
rax 0x7f0cf3647cf0 139693599784176
rbx 0x0 0
rcx 0x7f0ce8000038 139693408649272
rdx 0x3031406576616c73 3472627592201333875
rsi 0x2 2
rdi 0x7f0cf0613ac0 139693549238976
rbp 0x7f0ce80034c8 0x7f0ce80034c8
rsp 0x7f0cf0613c00 0x7f0cf0613c00
r8 0x7f0ce80009b0 139693408651696
r9 0x1 1
r10 0x6 6
r11 0x1 1
r12 0x7f0ce8001ca0 139693408656544
r13 0x7f0ce80056c0 139693408671424
r14 0x7f0ce8006cc0 139693408677056
r15 0x1305820 19945504
rip 0x7f0cf30fecd5 0x7f0cf30fecd5
<mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
const&, mesos::ExecutorID const&, int)+533>
eflags 0x10206 [ PF IF RF ]
cs 0xe033 57395
ss 0xe02b 57387
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
disassemble:
0x00007f0cf30fecb9 <+505>: mov %rax,0x20(%rsp)
0x00007f0cf30fecbe <+510>: xor %ebx,%ebx
0x00007f0cf30fecc0 <+512>: cmp 0x20(%rsp),%r12
0x00007f0cf30fecc5 <+517>: je 0x7f0cf30fed2e
<mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
const&, mesos::ExecutorID const&, int)+622>
0x00007f0cf30fecc7 <+519>: test %r12,%r12
0x00007f0cf30fecca <+522>: je 0x7f0cf30ff27d
<mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
const&, mesos::ExecutorID const&, int)+1981>
0x00007f0cf30fecd0 <+528>: mov 0x28(%r12),%rdx
=> 0x00007f0cf30fecd5 <+533>: mov 0x70(%rdx),%edi
0x00007f0cf30fecd8 <+536>: mov %rdx,0x8(%rsp)
0x00007f0cf30fecdd <+541>: callq 0x7f0cf3062220
<_ZN5mesos8internal5slave19isTerminalTaskStateENS_9TaskStateE@plt>
0x00007f0cf30fece2 <+546>: test %al,%al
0x00007f0cf30fece4 <+548>: mov 0x8(%rsp),%rdx
0x00007f0cf30fece9 <+553>: je 0x7f0cf30ff020
<mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
const&, mesos::ExecutorID const&, int)+1376>
0x00007f0cf30fecef <+559>: test %rbp,%rbp
0x00007f0cf30fecf2 <+562>: je 0x7f0cf30ff244
<mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
const&, mesos::ExecutorID const&, int)+1---Type <return> to continue,
or q <re
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira