[ 
https://issues.apache.org/jira/browse/MESOS-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851315#comment-15851315
 ] 

Alexander Rojas commented on MESOS-7036:
----------------------------------------

I've been giving a lot of thought to this and two things are definitely clear 
for me:

1. This is not a bug in libprocess, this is the equivalent of calling 
{{std::thread::join()}} within himself. Which is a bug from the side of the 
user of the library, but not a library bug. Since we can detect the deadlock 
though, I would suggest aborting in those situations just like {{std::thread}} 
does when you join with yourself (it actually throws an exception of type 
{{system_error}} with code {{resource_deadlock_would_occur}}).

2. There are two kinds of fixes for this problem based on the example made by 
[~alexr], the first is to make wrapper classes uncopyable. As a rule of thumb, 
if a class has a {{delete}} in its destructor (or a {{terminate}}), it must not 
be copyable, since the first copy to be destroyed invalidates all others. The 
second is to use a shared pointer as a RAII manager, e.g.:

{code}
class RateLimiter
{
public:
  RateLimiter(int permits, const Duration& duration);
  explicit RateLimiter(double permitsPerSecond);
  virtual ~RateLimiter();

  // Returns a future that becomes ready when the permit is acquired.
  // Discarding this future cancels this acquisition.
  virtual Future<Nothing> acquire() const;

private:
  // Not copyable, not assignable.
  RateLimiter(const RateLimiter&);
  RateLimiter& operator=(const RateLimiter&);

  std::shared_ptr<RateLimiterProcess> process;
};

RateLimiter::RateLimiter(int permits, const Duration& duration)
{
  // Custom destructor for `process` which will terminate and wait on the
  // process.
  process.reset(new RateLimiterProcess(...), [](RateLimiterProcess *process) {
    process::terminate(process);
    process::wait(process);
    delete process;
  });
}
{code}

Note that none of this issues completely resolves the fact that you can call 
{await()} and {terminate()} from yourself, but will reduce the changes that we 
make them in our code base. So if the actual cause is more complicated, 
something else may happen.

> Rate limiter deadlocks during IO Switchboard-related tests
> ----------------------------------------------------------
>
>                 Key: MESOS-7036
>                 URL: https://issues.apache.org/jira/browse/MESOS-7036
>             Project: Mesos
>          Issue Type: Bug
>          Components: test, tests
>         Environment: ASF CI
>            Reporter: Greg Mann
>            Priority: Critical
>              Labels: flaky, mesosphere
>         Attachments: AgentAPITest.LaunchNestedContainerSessionWithTTY.txt
>
>
> This has been observed a number of times recently on the ASF CI. While I 
> didn't look through every single failed test log, I've noticed the failure 
> occur during the following tests:
> {code}
> ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/1
> ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> IOSwitchboardTest.ContainerAttachAfterSlaveRestart
> ContentType/AgentAPITest.LaunchNestedContainerSession/1
> ContentType/AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> ContentType/AgentAPIStreamingTest.AttachContainerInput/0
> IOSwitchboardTest.ContainerAttach
> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession/0
> {code}
> In all cases, we see the following:
> {code}
> **** DEADLOCK DETECTED! ****
> You are waiting on process __limiter__(518)@172.17.0.3:35849 that it is 
> currently executing.
> {code}
> Find attached an entire example log.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to