[ https://issues.apache.org/jira/browse/MESOS-8594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369238#comment-16369238 ]
Benno Evers commented on MESOS-8594: ------------------------------------ The analysis by [~abudnik] seems to be correct, the actual site of the crash looks completely harmless with no dangling pointers or anything, and the call stack is very deep, going repeatedly through `process::internal::send()` and `process::internal::_send()`. (although The root cause seems to be this ancient TODO in `Future<T>::onAny()` {noformat} synchronized (data->lock) { if (data->state == PENDING) { data->onAnyCallbacks.emplace_back(std::move(callback)); } else { run = true; } } // TODO(*): Invoke callback in another execution context. if (run) { std::move(callback)(*this); // NOLINT(misc-use-after-move) }{noformat} so whenever we arrive in `send()` and the future returned by the socket is already finished, we add another 5-10 functions to the stack frame. Most likely, due the large number of big packets being sent over a loopback interface, there is always enough data to allow a large enough build-up to cause the program to run out of stack space. > Mesos master crash (under load) > ------------------------------- > > Key: MESOS-8594 > URL: https://issues.apache.org/jira/browse/MESOS-8594 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 1.5.0, 1.6.0 > Reporter: A. Dukhovniy > Priority: Major > Attachments: lldb-bt.txt, lldb-di-f.txt, lldb-image-section.txt, > lldb-regiser-read.txt > > > Mesos master crashes under load. Attached are some infos from the `lldb`: > {code:java} > Process 41933 resuming > Process 41933 stopped > * thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x7000089ecff8) > frame #0: 0x000000010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35 > 32 template <typename T> > 33 struct _Some > 34 { > -> 35 _Some(T _t) : t(std::move(_t)) {} > 36 > 37 T t; > 38 }; > Target 0: (mesos-master) stopped. > (lldb) > {code} > To quote [~abudnik] > {quote} > it’s the stack overflow bug in libprocess due to a way `internal::send()` and > `internal::_send()` are implemented in `process.cpp` > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)