This is really easy to reproduce (all commands from within the build directory):
$ ./bin/mesos-master.sh $ ./bin/mesos-slave.sh --master=localhost:5050 $./src/long-lived-framework localhost:5050 Wait a few seconds for the framework to launch it's jobs, then kill the long-lived-framework, and the slave should crash. You can go back and run the slave via ./bin/gdb-mesos-slave.sh and it will give you the stack trace that Scott included in his email. On Mon, May 7, 2012 at 10:44 AM, Vinod Kone <[email protected]> wrote: > Hi Ben/Scott, > > Can you provide the slave log of the repro? > > thanx, > @vinodkone > > > On Mon, May 7, 2012 at 10:00 AM, Benjamin Hindman <[email protected] > >wrote: > > > Hi Scott, > > > > Thanks for the report. I've been able to reproduce this and it is indeed > a > > regression. I've filed https://issues.apache.org/jira/browse/MESOS-190, > > and > > hopefully we'll get a fix out the door ASAP. > > > > Ben. > > > > > > On Fri, May 4, 2012 at 5:11 PM, Scott Smith <[email protected]> > wrote: > > > > > When I restart/kill early or otherwise interrupt my framework from the > > > client, I often segfault the slave. I'm not sure if there is a bug in > > > my executor, but it seems Mesos should be more resilient than this. > > > > > > Mesos subversion -r 1331158 > > > > > > I know optimized builds can be tricky to debug, but in this case it > > > does look like it was trying to dereference the invalid Task* address > > > (note that task matches %rdx, and the crashed assembly code is trying > > > to dereference %rdx). > > > > > > Any suggestions? > > > > > > (gdb) bt > > > #0 mesos::internal::slave::Slave::executorExited (this=0x1305820, > > > frameworkId=..., executorId=..., status=0) at slave/slave.cpp:1400 > > > #1 0x00007f0cf310526d in __call<process::ProcessBase*&, 0, 1> > > (__args=..., > > > this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153 > > > #2 operator()<process::ProcessBase*> (this=<optimized out>) > > > at /usr/include/c++/4.6/tr1/functional:1207 > > > #3 std::tr1::_Function_handler<void (process::ProcessBase*), > > > std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>, > > > std::tr1::shared_ptr<std::tr1::function<void > > > (mesos::internal::slave::Slave*)> >))(process::ProcessBase*, > > > std::tr1::shared_ptr<std::tr1::function<void > > > (mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data > > > const&, process::ProcessBase*) (__functor=..., > > > __args#0=<optimized out>) at > /usr/include/c++/4.6/tr1/functional:1684 > > > #4 0x00007f0cf32014a3 in std::tr1::function<void > > > (process::ProcessBase*)>::operator()(process::ProcessBase*) const () > > > from /home/ubuntu/cr/lib/libmesos-0.9.0.so > > > #5 0x00007f0cf31f617f in > > > process::ProcessBase::visit(process::DispatchEvent const&) () from > > > /home/ubuntu/cr/lib/libmesos-0.9.0.so > > > #6 0x00007f0cf31f885c in > > > process::DispatchEvent::visit(process::EventVisitor*) const () from > > > /home/ubuntu/cr/lib/libmesos-0.9.0.so > > > #7 0x00007f0cf31f38cf in > > > process::ProcessManager::resume(process::ProcessBase*) () from > > > /home/ubuntu/cr/lib/libmesos-0.9.0.so > > > #8 0x00007f0cf31ec783 in process::schedule(void*) () > > > from /home/ubuntu/cr/lib/libmesos-0.9.0.so > > > #9 0x00007f0cf26e5e9a in start_thread () > > > from /lib/x86_64-linux-gnu/libpthread.so.0 > > > #10 0x00007f0cf24134bd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > > > #11 0x0000000000000000 in ?? () > > > (gdb) print task > > > $1 = (mesos::internal::Task *) 0x3031406576616c73 > > > (gdb) info register > > > rax 0x7f0cf3647cf0 139693599784176 > > > rbx 0x0 0 > > > rcx 0x7f0ce8000038 139693408649272 > > > rdx 0x3031406576616c73 3472627592201333875 > > > rsi 0x2 2 > > > rdi 0x7f0cf0613ac0 139693549238976 > > > rbp 0x7f0ce80034c8 0x7f0ce80034c8 > > > rsp 0x7f0cf0613c00 0x7f0cf0613c00 > > > r8 0x7f0ce80009b0 139693408651696 > > > r9 0x1 1 > > > r10 0x6 6 > > > r11 0x1 1 > > > r12 0x7f0ce8001ca0 139693408656544 > > > r13 0x7f0ce80056c0 139693408671424 > > > r14 0x7f0ce8006cc0 139693408677056 > > > r15 0x1305820 19945504 > > > rip 0x7f0cf30fecd5 0x7f0cf30fecd5 > > > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID > > > const&, mesos::ExecutorID const&, int)+533> > > > eflags 0x10206 [ PF IF RF ] > > > cs 0xe033 57395 > > > ss 0xe02b 57387 > > > ds 0x0 0 > > > es 0x0 0 > > > fs 0x0 0 > > > gs 0x0 0 > > > > > > disassemble: > > > > > > 0x00007f0cf30fecb9 <+505>: mov %rax,0x20(%rsp) > > > 0x00007f0cf30fecbe <+510>: xor %ebx,%ebx > > > 0x00007f0cf30fecc0 <+512>: cmp 0x20(%rsp),%r12 > > > 0x00007f0cf30fecc5 <+517>: je 0x7f0cf30fed2e > > > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID > > > const&, mesos::ExecutorID const&, int)+622> > > > 0x00007f0cf30fecc7 <+519>: test %r12,%r12 > > > 0x00007f0cf30fecca <+522>: je 0x7f0cf30ff27d > > > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID > > > const&, mesos::ExecutorID const&, int)+1981> > > > 0x00007f0cf30fecd0 <+528>: mov 0x28(%r12),%rdx > > > => 0x00007f0cf30fecd5 <+533>: mov 0x70(%rdx),%edi > > > 0x00007f0cf30fecd8 <+536>: mov %rdx,0x8(%rsp) > > > 0x00007f0cf30fecdd <+541>: callq 0x7f0cf3062220 > > > <_ZN5mesos8internal5slave19isTerminalTaskStateENS_9TaskStateE@plt> > > > 0x00007f0cf30fece2 <+546>: test %al,%al > > > 0x00007f0cf30fece4 <+548>: mov 0x8(%rsp),%rdx > > > 0x00007f0cf30fece9 <+553>: je 0x7f0cf30ff020 > > > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID > > > const&, mesos::ExecutorID const&, int)+1376> > > > 0x00007f0cf30fecef <+559>: test %rbp,%rbp > > > 0x00007f0cf30fecf2 <+562>: je 0x7f0cf30ff244 > > > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID > > > const&, mesos::ExecutorID const&, int)+1---Type <return> to continue, > > > or q <re > > > > > > -- > > > Scott > > > > > >
