[
https://issues.apache.org/jira/browse/MESOS-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Hindman resolved MESOS-873.
------------------------------------
Resolution: Fixed
Fix Version/s: 0.19.0
Assignee: Benjamin Hindman
> Crash in os::killtree on Mavericks
> -----------------------------------
>
> Key: MESOS-873
> URL: https://issues.apache.org/jira/browse/MESOS-873
> Project: Mesos
> Issue Type: Bug
> Components: libprocess
> Environment: Mac OS X Mavericks
> Reporter: Niklas Quarfot Nielsen
> Assignee: Benjamin Hindman
> Fix For: 0.19.0
>
>
> This is a crash we experienced on a Mavericks installation. We haven't been
> able to reproduce it on other machines since, but managed to capture core
> files from the crashes.
> Here is the stack trace from the crashing thread:
> thread #2: tid = 0x0001, 0x0000000106816de5 mesos-executor`os::process(int)
> + 4133, stop reason = signal SIGSTOP
> frame #0: 0x0000000106816de5 mesos-executor`os::process(int) + 4133
> frame #1: 0x000000010681734c mesos-executor`os::processes() + 316
> frame #2: 0x0000000106817752 mesos-executor`os::killtree(int, int, bool,
> bool) + 66
> frame #3: 0x0000000106819748
> mesos-executor`mesos::internal::CommandExecutorProcess::shutdown(mesos::ExecutorDriver*)
> + 200
> frame #4: 0x000000010798be70
> frame #5: 0x000000010798be60
> frame #6: 0x0000000106b21c20
> libmesos-0.16.0.dylib`process::Event::~Event() + 32
> frame #7: 0x90c307894810c083
> The stop condition is wrong (all threads in the core file is reported as
> stopped).
> Here is a snippet of disassemble of the failing frame:
> 0x106817306: je 0x106817460 ; os::processes() + 592
> 0x10681730c: movq 16(%rsp), %rax
> 0x106817311: movq 296(%rsp), %rbx
> 0x106817319: leaq 16(%rax), %r14
> 0x10681731d: leaq 128(%rsp), %rax
> 0x106817325: addq $8, %r14
> 0x106817329: movq %rax, 24(%rsp)
> 0x10681732e: leaq 384(%rsp), %rbp
> 0x106817336: cmpq %rbx, %r14
> 0x106817339: je 0x106817530 ; os::processes() + 800
> 0x10681733f: movl 32(%rbx), %esi
> 0x106817342: movq 24(%rsp), %rdi
> 0x106817347: callq 0x10681d5a0 ; symbol stub for:
> os::process(int)
> -> 0x10681734c: movl 128(%rsp), %esi
> 0x106817353: testl %esi, %esi
> 0x106817355: jne 0x1068173e0 ; os::processes() + 464
> 0x10681735b: movq 136(%rsp), %rsi
> 0x106817363: movq %rbp, %rdi
> 0x106817366: callq 0x10681d58e ; symbol stub for:
> os::Process::Process(os::Process const&)
> 0x10681736b: movl $112, %edi
> 0x106817370: callq 0x10681d9e4 ; symbol stub for: operator
> new(unsigned long)
> We got to (while investigation the crash live in lldb) that using sysctl to
> get argument count probably was the reason for the crash, but still with no
> ways to validate this.
> We can dig further into the core dump, if you know any suspected reasons for
> the failure / where to look further.
> Also, since we haven't been able to reproduce the crash. If we don't hear of
> any others with the same problem, we can probably mark this as won't fix.
--
This message was sent by Atlassian JIRA
(v6.2#6252)