Hi Ben/Scott,

Can you provide the slave log of the repro?

thanx,
@vinodkone


On Mon, May 7, 2012 at 10:00 AM, Benjamin Hindman <[email protected]>wrote:

> Hi Scott,
>
> Thanks for the report. I've been able to reproduce this and it is indeed a
> regression. I've filed https://issues.apache.org/jira/browse/MESOS-190,
> and
> hopefully we'll get a fix out the door ASAP.
>
> Ben.
>
>
> On Fri, May 4, 2012 at 5:11 PM, Scott Smith <[email protected]> wrote:
>
> > When I restart/kill early or otherwise interrupt my framework from the
> > client, I often segfault the slave.  I'm not sure if there is a bug in
> > my executor, but it seems Mesos should be more resilient than this.
> >
> > Mesos subversion -r 1331158
> >
> > I know optimized builds can be tricky to debug, but in this case it
> > does look like it was trying to dereference the invalid Task* address
> > (note that task matches %rdx, and the crashed assembly code is trying
> > to dereference %rdx).
> >
> > Any suggestions?
> >
> > (gdb) bt
> > #0  mesos::internal::slave::Slave::executorExited (this=0x1305820,
> >    frameworkId=..., executorId=..., status=0) at slave/slave.cpp:1400
> > #1  0x00007f0cf310526d in __call<process::ProcessBase*&, 0, 1>
> (__args=...,
> >    this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153
> > #2  operator()<process::ProcessBase*> (this=<optimized out>)
> >    at /usr/include/c++/4.6/tr1/functional:1207
> > #3  std::tr1::_Function_handler<void (process::ProcessBase*),
> > std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>,
> > std::tr1::shared_ptr<std::tr1::function<void
> > (mesos::internal::slave::Slave*)> >))(process::ProcessBase*,
> > std::tr1::shared_ptr<std::tr1::function<void
> > (mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data
> > const&, process::ProcessBase*) (__functor=...,
> >    __args#0=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1684
> > #4  0x00007f0cf32014a3 in std::tr1::function<void
> > (process::ProcessBase*)>::operator()(process::ProcessBase*) const ()
> >   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> > #5  0x00007f0cf31f617f in
> > process::ProcessBase::visit(process::DispatchEvent const&) () from
> > /home/ubuntu/cr/lib/libmesos-0.9.0.so
> > #6  0x00007f0cf31f885c in
> > process::DispatchEvent::visit(process::EventVisitor*) const () from
> > /home/ubuntu/cr/lib/libmesos-0.9.0.so
> > #7  0x00007f0cf31f38cf in
> > process::ProcessManager::resume(process::ProcessBase*) () from
> > /home/ubuntu/cr/lib/libmesos-0.9.0.so
> > #8  0x00007f0cf31ec783 in process::schedule(void*) ()
> >   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> > #9  0x00007f0cf26e5e9a in start_thread ()
> >   from /lib/x86_64-linux-gnu/libpthread.so.0
> > #10 0x00007f0cf24134bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> > #11 0x0000000000000000 in ?? ()
> > (gdb) print task
> > $1 = (mesos::internal::Task *) 0x3031406576616c73
> > (gdb) info register
> > rax            0x7f0cf3647cf0   139693599784176
> > rbx            0x0      0
> > rcx            0x7f0ce8000038   139693408649272
> > rdx            0x3031406576616c73       3472627592201333875
> > rsi            0x2      2
> > rdi            0x7f0cf0613ac0   139693549238976
> > rbp            0x7f0ce80034c8   0x7f0ce80034c8
> > rsp            0x7f0cf0613c00   0x7f0cf0613c00
> > r8             0x7f0ce80009b0   139693408651696
> > r9             0x1      1
> > r10            0x6      6
> > r11            0x1      1
> > r12            0x7f0ce8001ca0   139693408656544
> > r13            0x7f0ce80056c0   139693408671424
> > r14            0x7f0ce8006cc0   139693408677056
> > r15            0x1305820        19945504
> > rip            0x7f0cf30fecd5   0x7f0cf30fecd5
> > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> > const&, mesos::ExecutorID const&, int)+533>
> > eflags         0x10206  [ PF IF RF ]
> > cs             0xe033   57395
> > ss             0xe02b   57387
> > ds             0x0      0
> > es             0x0      0
> > fs             0x0      0
> > gs             0x0      0
> >
> > disassemble:
> >
> >  0x00007f0cf30fecb9 <+505>:    mov    %rax,0x20(%rsp)
> >   0x00007f0cf30fecbe <+510>:   xor    %ebx,%ebx
> >   0x00007f0cf30fecc0 <+512>:   cmp    0x20(%rsp),%r12
> >   0x00007f0cf30fecc5 <+517>:   je     0x7f0cf30fed2e
> > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> > const&, mesos::ExecutorID const&, int)+622>
> >   0x00007f0cf30fecc7 <+519>:   test   %r12,%r12
> >   0x00007f0cf30fecca <+522>:   je     0x7f0cf30ff27d
> > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> > const&, mesos::ExecutorID const&, int)+1981>
> >   0x00007f0cf30fecd0 <+528>:   mov    0x28(%r12),%rdx
> > => 0x00007f0cf30fecd5 <+533>:   mov    0x70(%rdx),%edi
> >   0x00007f0cf30fecd8 <+536>:   mov    %rdx,0x8(%rsp)
> >   0x00007f0cf30fecdd <+541>:   callq  0x7f0cf3062220
> > <_ZN5mesos8internal5slave19isTerminalTaskStateENS_9TaskStateE@plt>
> >   0x00007f0cf30fece2 <+546>:   test   %al,%al
> >   0x00007f0cf30fece4 <+548>:   mov    0x8(%rsp),%rdx
> >   0x00007f0cf30fece9 <+553>:   je     0x7f0cf30ff020
> > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> > const&, mesos::ExecutorID const&, int)+1376>
> >   0x00007f0cf30fecef <+559>:   test   %rbp,%rbp
> >   0x00007f0cf30fecf2 <+562>:   je     0x7f0cf30ff244
> > <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> > const&, mesos::ExecutorID const&, int)+1---Type <return> to continue,
> > or q <re
> >
> > --
> >         Scott
> >
>

Reply via email to