[ 
https://issues.apache.org/jira/browse/MESOS-8594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369238#comment-16369238
 ] 

Benno Evers commented on MESOS-8594:
------------------------------------

The analysis by [~abudnik] seems to be correct, the actual site of the crash 
looks completely harmless with no dangling pointers or anything, and the call 
stack is very deep, going repeatedly through `process::internal::send()` and 
`process::internal::_send()`. (although

 

The root cause seems to be this ancient TODO in `Future<T>::onAny()`
{noformat}
  synchronized (data->lock) {
    if (data->state == PENDING) {
      data->onAnyCallbacks.emplace_back(std::move(callback));
    } else {
      run = true;
    }
  }

  // TODO(*): Invoke callback in another execution context.
  if (run) {
    std::move(callback)(*this); // NOLINT(misc-use-after-move)
  }{noformat}
 

so whenever we arrive in `send()` and the future returned by the socket is 
already finished, we add another 5-10 functions to the stack frame.

 

Most likely, due the large number of big packets being sent over a loopback 
interface, there is always enough data to allow a large enough build-up to 
cause the program to run out of stack space.

 

> Mesos master crash (under load)
> -------------------------------
>
>                 Key: MESOS-8594
>                 URL: https://issues.apache.org/jira/browse/MESOS-8594
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.5.0, 1.6.0
>            Reporter: A. Dukhovniy
>            Priority: Major
>         Attachments: lldb-bt.txt, lldb-di-f.txt, lldb-image-section.txt, 
> lldb-regiser-read.txt
>
>
> Mesos master crashes under load. Attached are some infos from the `lldb`:
> {code:java}
> Process 41933 resuming
> Process 41933 stopped
> * thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x7000089ecff8)
> frame #0: 0x000000010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35
> 32 template <typename T>
> 33 struct _Some
> 34 {
> -> 35 _Some(T _t) : t(std::move(_t)) {}
> 36
> 37 T t;
> 38 };
> Target 0: (mesos-master) stopped.
> (lldb)
> {code}
> To quote [~abudnik] 
> {quote}
> it’s the stack overflow bug in libprocess due to a way `internal::send()` and 
> `internal::_send()` are implemented in `process.cpp`
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to