Hi Scott, Thanks for the report. I've been able to reproduce this and it is indeed a regression. I've filed https://issues.apache.org/jira/browse/MESOS-190, and hopefully we'll get a fix out the door ASAP.
Ben. On Fri, May 4, 2012 at 5:11 PM, Scott Smith <[email protected]> wrote: > When I restart/kill early or otherwise interrupt my framework from the > client, I often segfault the slave. I'm not sure if there is a bug in > my executor, but it seems Mesos should be more resilient than this. > > Mesos subversion -r 1331158 > > I know optimized builds can be tricky to debug, but in this case it > does look like it was trying to dereference the invalid Task* address > (note that task matches %rdx, and the crashed assembly code is trying > to dereference %rdx). > > Any suggestions? > > (gdb) bt > #0 mesos::internal::slave::Slave::executorExited (this=0x1305820, > frameworkId=..., executorId=..., status=0) at slave/slave.cpp:1400 > #1 0x00007f0cf310526d in __call<process::ProcessBase*&, 0, 1> (__args=..., > this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153 > #2 operator()<process::ProcessBase*> (this=<optimized out>) > at /usr/include/c++/4.6/tr1/functional:1207 > #3 std::tr1::_Function_handler<void (process::ProcessBase*), > std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>, > std::tr1::shared_ptr<std::tr1::function<void > (mesos::internal::slave::Slave*)> >))(process::ProcessBase*, > std::tr1::shared_ptr<std::tr1::function<void > (mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data > const&, process::ProcessBase*) (__functor=..., > __args#0=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1684 > #4 0x00007f0cf32014a3 in std::tr1::function<void > (process::ProcessBase*)>::operator()(process::ProcessBase*) const () > from /home/ubuntu/cr/lib/libmesos-0.9.0.so > #5 0x00007f0cf31f617f in > process::ProcessBase::visit(process::DispatchEvent const&) () from > /home/ubuntu/cr/lib/libmesos-0.9.0.so > #6 0x00007f0cf31f885c in > process::DispatchEvent::visit(process::EventVisitor*) const () from > /home/ubuntu/cr/lib/libmesos-0.9.0.so > #7 0x00007f0cf31f38cf in > process::ProcessManager::resume(process::ProcessBase*) () from > /home/ubuntu/cr/lib/libmesos-0.9.0.so > #8 0x00007f0cf31ec783 in process::schedule(void*) () > from /home/ubuntu/cr/lib/libmesos-0.9.0.so > #9 0x00007f0cf26e5e9a in start_thread () > from /lib/x86_64-linux-gnu/libpthread.so.0 > #10 0x00007f0cf24134bd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > #11 0x0000000000000000 in ?? () > (gdb) print task > $1 = (mesos::internal::Task *) 0x3031406576616c73 > (gdb) info register > rax 0x7f0cf3647cf0 139693599784176 > rbx 0x0 0 > rcx 0x7f0ce8000038 139693408649272 > rdx 0x3031406576616c73 3472627592201333875 > rsi 0x2 2 > rdi 0x7f0cf0613ac0 139693549238976 > rbp 0x7f0ce80034c8 0x7f0ce80034c8 > rsp 0x7f0cf0613c00 0x7f0cf0613c00 > r8 0x7f0ce80009b0 139693408651696 > r9 0x1 1 > r10 0x6 6 > r11 0x1 1 > r12 0x7f0ce8001ca0 139693408656544 > r13 0x7f0ce80056c0 139693408671424 > r14 0x7f0ce8006cc0 139693408677056 > r15 0x1305820 19945504 > rip 0x7f0cf30fecd5 0x7f0cf30fecd5 > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID > const&, mesos::ExecutorID const&, int)+533> > eflags 0x10206 [ PF IF RF ] > cs 0xe033 57395 > ss 0xe02b 57387 > ds 0x0 0 > es 0x0 0 > fs 0x0 0 > gs 0x0 0 > > disassemble: > > 0x00007f0cf30fecb9 <+505>: mov %rax,0x20(%rsp) > 0x00007f0cf30fecbe <+510>: xor %ebx,%ebx > 0x00007f0cf30fecc0 <+512>: cmp 0x20(%rsp),%r12 > 0x00007f0cf30fecc5 <+517>: je 0x7f0cf30fed2e > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID > const&, mesos::ExecutorID const&, int)+622> > 0x00007f0cf30fecc7 <+519>: test %r12,%r12 > 0x00007f0cf30fecca <+522>: je 0x7f0cf30ff27d > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID > const&, mesos::ExecutorID const&, int)+1981> > 0x00007f0cf30fecd0 <+528>: mov 0x28(%r12),%rdx > => 0x00007f0cf30fecd5 <+533>: mov 0x70(%rdx),%edi > 0x00007f0cf30fecd8 <+536>: mov %rdx,0x8(%rsp) > 0x00007f0cf30fecdd <+541>: callq 0x7f0cf3062220 > <_ZN5mesos8internal5slave19isTerminalTaskStateENS_9TaskStateE@plt> > 0x00007f0cf30fece2 <+546>: test %al,%al > 0x00007f0cf30fece4 <+548>: mov 0x8(%rsp),%rdx > 0x00007f0cf30fece9 <+553>: je 0x7f0cf30ff020 > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID > const&, mesos::ExecutorID const&, int)+1376> > 0x00007f0cf30fecef <+559>: test %rbp,%rbp > 0x00007f0cf30fecf2 <+562>: je 0x7f0cf30ff244 > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID > const&, mesos::ExecutorID const&, int)+1---Type <return> to continue, > or q <re > > -- > Scott >
