[jira] [Updated] (MESOS-8256) Libprocess can silently deadlock due to worker thread exhaustion.

Benjamin Mahler (JIRA) Wed, 22 Nov 2017 19:36:18 -0800

     [ 
https://issues.apache.org/jira/browse/MESOS-8256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Benjamin Mahler updated MESOS-8256:
-----------------------------------
    Description: 
Currently, libprocess uses a fixed number of worker threads. This means that 
any code that blocks a worker thread and requires another worker thread to 
unblock it can lead to deadlock if there are sufficiently many of these to 
block all of the worker threads. The deadlock will occur without any logging of 
it and we don't expose an endpoint for it either.

Our current approach to avoid this issue is to (1) forbid blocking a worker 
thread, however there is a lot of blocking code using {{process::wait}} (the 
alternative is to spawn a managed process) and other code still performs other 
blocking (such as {{ZooKeeper}} or custom module code, this code could be fixed 
to be non-blocking), and (2) set the worker thread pool minimum size to a known 
safe value. (2) is brittle and we cannot determine the minimum safe number 
easily as the code evolves, and as users run module code.

Ideally:

(1) We can indicate that the deadlock occurs via a log message and also 
possibly an endpoint, or even crashing! Ideally, the user can see all of the 
stack traces to know why the deadlock occurred.

(2) Libprocess could keep a dynamically sized worker pool. At the very least, 
we could detect deadlock and spawn additional threads to get out of it, 
removing these threads later.

  was:
Currently, libprocess uses a fixed number of worker threads. This means that 
any code that blocks a worker thread and requires another worker thread to 
unblock it can lead to deadlock if there are sufficiently many of these to 
block all of the worker threads. The deadlock will occur without any logging of 
it and we don't expose an endpoint for it either.

Our current approach to avoid this issue is to (1) forbid blocking a worker 
thread, however you must write blocking code due to {{process::wait}} and other 
code still performs other blocking (such as {{ZooKeeper}} or custom module 
code) and (2) set the worker thread pool minimum size to a known safe value. 
Now that there is module code, (2) is brittle and we cannot determine the 
minimum safe number anymore.

Ideally:

(1) We can indicate that the deadlock occurs via a log message and also 
possibly an endpoint, or even crashing! Ideally, the user can see all of the 
stack traces to know why the deadlock occurred.

(2) Libprocess could keep a dynamically sized worker pool. At the very least, 
we could detect deadlock and spawn additional threads to get out of it, 
removing these threads later.


> Libprocess can silently deadlock due to worker thread exhaustion.
> -----------------------------------------------------------------
>
>                 Key: MESOS-8256
>                 URL: https://issues.apache.org/jira/browse/MESOS-8256
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess
>            Reporter: Benjamin Mahler
>            Priority: Critical
>
> Currently, libprocess uses a fixed number of worker threads. This means that 
> any code that blocks a worker thread and requires another worker thread to 
> unblock it can lead to deadlock if there are sufficiently many of these to 
> block all of the worker threads. The deadlock will occur without any logging 
> of it and we don't expose an endpoint for it either.
> Our current approach to avoid this issue is to (1) forbid blocking a worker 
> thread, however there is a lot of blocking code using {{process::wait}} (the 
> alternative is to spawn a managed process) and other code still performs 
> other blocking (such as {{ZooKeeper}} or custom module code, this code could 
> be fixed to be non-blocking), and (2) set the worker thread pool minimum size 
> to a known safe value. (2) is brittle and we cannot determine the minimum 
> safe number easily as the code evolves, and as users run module code.
> Ideally:
> (1) We can indicate that the deadlock occurs via a log message and also 
> possibly an endpoint, or even crashing! Ideally, the user can see all of the 
> stack traces to know why the deadlock occurred.
> (2) Libprocess could keep a dynamically sized worker pool. At the very least, 
> we could detect deadlock and spawn additional threads to get out of it, 
> removing these threads later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MESOS-8256) Libprocess can silently deadlock due to worker thread exhaustion.

Reply via email to