Benjamin Mahler created MESOS-8256:
--------------------------------------
Summary: Libprocess can silently deadlock due to worker thread
exhaustion.
Key: MESOS-8256
URL: https://issues.apache.org/jira/browse/MESOS-8256
Project: Mesos
Issue Type: Bug
Components: libprocess
Reporter: Benjamin Mahler
Priority: Critical
Currently, libprocess uses a fixed number of worker threads. This means that
any code that blocks a worker thread and requires another worker thread to
unblock it can lead to deadlock if there are sufficiently many of these to
block all of the worker threads. The deadlock will occur without any logging of
it and we don't expose an endpoint for it either.
Our current approach to avoid this issue is to (1) forbid blocking a worker
thread, however there still remains some blocking code (such as {{ZooKeeper}}
or custom module code) and (2) set the worker thread pool minimum size to a
known safe value. Now that there is module code, (2) is brittle and we cannot
determine the minimum safe number anymore.
Ideally:
(1) We can indicate that the deadlock occurs via a log message and also
possibly an endpoint, or even crashing! Ideally, the user can see all of the
stack traces to know why the deadlock occurred.
(2) Libprocess could keep a dynamically sized worker pool. At the very least,
we could detect deadlock and spawn additional threads to get out of it,
removing these threads later.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)