[
https://issues.apache.org/jira/browse/MESOS-8594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369238#comment-16369238
]
Benno Evers commented on MESOS-8594:
------------------------------------
The analysis by [~abudnik] seems to be correct, the actual site of the crash
looks completely harmless with no dangling pointers or anything, and the call
stack is very deep, going repeatedly through `process::internal::send()` and
`process::internal::_send()`. (although
The root cause seems to be this ancient TODO in `Future<T>::onAny()`
{noformat}
synchronized (data->lock) {
if (data->state == PENDING) {
data->onAnyCallbacks.emplace_back(std::move(callback));
} else {
run = true;
}
}
// TODO(*): Invoke callback in another execution context.
if (run) {
std::move(callback)(*this); // NOLINT(misc-use-after-move)
}{noformat}
so whenever we arrive in `send()` and the future returned by the socket is
already finished, we add another 5-10 functions to the stack frame.
Most likely, due the large number of big packets being sent over a loopback
interface, there is always enough data to allow a large enough build-up to
cause the program to run out of stack space.
> Mesos master crash (under load)
> -------------------------------
>
> Key: MESOS-8594
> URL: https://issues.apache.org/jira/browse/MESOS-8594
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.5.0, 1.6.0
> Reporter: A. Dukhovniy
> Priority: Major
> Attachments: lldb-bt.txt, lldb-di-f.txt, lldb-image-section.txt,
> lldb-regiser-read.txt
>
>
> Mesos master crashes under load. Attached are some infos from the `lldb`:
> {code:java}
> Process 41933 resuming
> Process 41933 stopped
> * thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x7000089ecff8)
> frame #0: 0x000000010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35
> 32 template <typename T>
> 33 struct _Some
> 34 {
> -> 35 _Some(T _t) : t(std::move(_t)) {}
> 36
> 37 T t;
> 38 };
> Target 0: (mesos-master) stopped.
> (lldb)
> {code}
> To quote [~abudnik]
> {quote}
> it’s the stack overflow bug in libprocess due to a way `internal::send()` and
> `internal::_send()` are implemented in `process.cpp`
> {quote}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)