This is really easy to reproduce (all commands from within the build
directory):

$ ./bin/mesos-master.sh
$ ./bin/mesos-slave.sh --master=localhost:5050
$./src/long-lived-framework localhost:5050

Wait a few seconds for the framework to launch it's jobs, then kill the
long-lived-framework, and the slave should crash.

You can go back and run the slave via ./bin/gdb-mesos-slave.sh and it will
give you the stack trace that Scott included in his email.



On Mon, May 7, 2012 at 10:44 AM, Vinod Kone <[email protected]> wrote:

> Hi Ben/Scott,
>
> Can you provide the slave log of the repro?
>
> thanx,
> @vinodkone
>
>
> On Mon, May 7, 2012 at 10:00 AM, Benjamin Hindman <[email protected]
> >wrote:
>
> > Hi Scott,
> >
> > Thanks for the report. I've been able to reproduce this and it is indeed
> a
> > regression. I've filed https://issues.apache.org/jira/browse/MESOS-190,
> > and
> > hopefully we'll get a fix out the door ASAP.
> >
> > Ben.
> >
> >
> > On Fri, May 4, 2012 at 5:11 PM, Scott Smith <[email protected]>
> wrote:
> >
> > > When I restart/kill early or otherwise interrupt my framework from the
> > > client, I often segfault the slave.  I'm not sure if there is a bug in
> > > my executor, but it seems Mesos should be more resilient than this.
> > >
> > > Mesos subversion -r 1331158
> > >
> > > I know optimized builds can be tricky to debug, but in this case it
> > > does look like it was trying to dereference the invalid Task* address
> > > (note that task matches %rdx, and the crashed assembly code is trying
> > > to dereference %rdx).
> > >
> > > Any suggestions?
> > >
> > > (gdb) bt
> > > #0  mesos::internal::slave::Slave::executorExited (this=0x1305820,
> > >    frameworkId=..., executorId=..., status=0) at slave/slave.cpp:1400
> > > #1  0x00007f0cf310526d in __call<process::ProcessBase*&, 0, 1>
> > (__args=...,
> > >    this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153
> > > #2  operator()<process::ProcessBase*> (this=<optimized out>)
> > >    at /usr/include/c++/4.6/tr1/functional:1207
> > > #3  std::tr1::_Function_handler<void (process::ProcessBase*),
> > > std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>,
> > > std::tr1::shared_ptr<std::tr1::function<void
> > > (mesos::internal::slave::Slave*)> >))(process::ProcessBase*,
> > > std::tr1::shared_ptr<std::tr1::function<void
> > > (mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data
> > > const&, process::ProcessBase*) (__functor=...,
> > >    __args#0=<optimized out>) at
> /usr/include/c++/4.6/tr1/functional:1684
> > > #4  0x00007f0cf32014a3 in std::tr1::function<void
> > > (process::ProcessBase*)>::operator()(process::ProcessBase*) const ()
> > >   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> > > #5  0x00007f0cf31f617f in
> > > process::ProcessBase::visit(process::DispatchEvent const&) () from
> > > /home/ubuntu/cr/lib/libmesos-0.9.0.so
> > > #6  0x00007f0cf31f885c in
> > > process::DispatchEvent::visit(process::EventVisitor*) const () from
> > > /home/ubuntu/cr/lib/libmesos-0.9.0.so
> > > #7  0x00007f0cf31f38cf in
> > > process::ProcessManager::resume(process::ProcessBase*) () from
> > > /home/ubuntu/cr/lib/libmesos-0.9.0.so
> > > #8  0x00007f0cf31ec783 in process::schedule(void*) ()
> > >   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> > > #9  0x00007f0cf26e5e9a in start_thread ()
> > >   from /lib/x86_64-linux-gnu/libpthread.so.0
> > > #10 0x00007f0cf24134bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> > > #11 0x0000000000000000 in ?? ()
> > > (gdb) print task
> > > $1 = (mesos::internal::Task *) 0x3031406576616c73
> > > (gdb) info register
> > > rax            0x7f0cf3647cf0   139693599784176
> > > rbx            0x0      0
> > > rcx            0x7f0ce8000038   139693408649272
> > > rdx            0x3031406576616c73       3472627592201333875
> > > rsi            0x2      2
> > > rdi            0x7f0cf0613ac0   139693549238976
> > > rbp            0x7f0ce80034c8   0x7f0ce80034c8
> > > rsp            0x7f0cf0613c00   0x7f0cf0613c00
> > > r8             0x7f0ce80009b0   139693408651696
> > > r9             0x1      1
> > > r10            0x6      6
> > > r11            0x1      1
> > > r12            0x7f0ce8001ca0   139693408656544
> > > r13            0x7f0ce80056c0   139693408671424
> > > r14            0x7f0ce8006cc0   139693408677056
> > > r15            0x1305820        19945504
> > > rip            0x7f0cf30fecd5   0x7f0cf30fecd5
> > > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> > > const&, mesos::ExecutorID const&, int)+533>
> > > eflags         0x10206  [ PF IF RF ]
> > > cs             0xe033   57395
> > > ss             0xe02b   57387
> > > ds             0x0      0
> > > es             0x0      0
> > > fs             0x0      0
> > > gs             0x0      0
> > >
> > > disassemble:
> > >
> > >  0x00007f0cf30fecb9 <+505>:    mov    %rax,0x20(%rsp)
> > >   0x00007f0cf30fecbe <+510>:   xor    %ebx,%ebx
> > >   0x00007f0cf30fecc0 <+512>:   cmp    0x20(%rsp),%r12
> > >   0x00007f0cf30fecc5 <+517>:   je     0x7f0cf30fed2e
> > > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> > > const&, mesos::ExecutorID const&, int)+622>
> > >   0x00007f0cf30fecc7 <+519>:   test   %r12,%r12
> > >   0x00007f0cf30fecca <+522>:   je     0x7f0cf30ff27d
> > > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> > > const&, mesos::ExecutorID const&, int)+1981>
> > >   0x00007f0cf30fecd0 <+528>:   mov    0x28(%r12),%rdx
> > > => 0x00007f0cf30fecd5 <+533>:   mov    0x70(%rdx),%edi
> > >   0x00007f0cf30fecd8 <+536>:   mov    %rdx,0x8(%rsp)
> > >   0x00007f0cf30fecdd <+541>:   callq  0x7f0cf3062220
> > > <_ZN5mesos8internal5slave19isTerminalTaskStateENS_9TaskStateE@plt>
> > >   0x00007f0cf30fece2 <+546>:   test   %al,%al
> > >   0x00007f0cf30fece4 <+548>:   mov    0x8(%rsp),%rdx
> > >   0x00007f0cf30fece9 <+553>:   je     0x7f0cf30ff020
> > > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> > > const&, mesos::ExecutorID const&, int)+1376>
> > >   0x00007f0cf30fecef <+559>:   test   %rbp,%rbp
> > >   0x00007f0cf30fecf2 <+562>:   je     0x7f0cf30ff244
> > > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> > > const&, mesos::ExecutorID const&, int)+1---Type <return> to continue,
> > > or q <re
> > >
> > > --
> > >         Scott
> > >
> >
>

Reply via email to