Author: Jan Uhlig <juhlig(at)hnc-agency(dot)org>, Maria Scott <maria-12648430(at)hnc-agency(dot)org>
    Status: Draft
    Type: Standards Track
    Created: 04-Mar-2021
    Erlang-Version: 25
    Post-History:
    Replaces:
****
EEP XXX: Automatic supervisor shutdown triggered by termination of significant childs
----


Abstract
========

This EEP introduces a way of automatically terminating supervisors based
on the termination of specifically marked significant childs.

This document is based on the discussion in [OTP-PR 4521][].


Motivation
==========

Childs under a supervisor often represent a work unit, that means, a group
of cooperating processes, as opposed to just a single process. Such work
unit supervisors (called group supervisors in the context of this document)
are themselves typically hosted by a simple_one_for_one supervisor, via
which they are started as needed.

At the time of this writing, however, there is no good, canonical way of
stopping such group supervisors once the work unit they represent has
finished it's work and the respective child processes have terminated,
meaning the group supervisors will hang around, idle forever unless
stopped manually one way or another.

This has been addressed in applications in a variety of ways, none of which
can be called truly good, straightforward, or canonical:

* Passing the Pid of the group supervisor to a child that is responsible for
  shutting down that supervisor. The shutdown is then achieved by sending
  an exit signal to the group supervisor. While the appropriate exit signals
  are documented, it is not for this purpose, and flinging around exit signals
  can be dangerous.
* Passing the Pid of the group supervisor and the Pid of the supervisor on top
  of it to a child that is responsible for shutting down that supervisor.
  The shutdown is then achieved by telling the supervisor on top to terminate
  the group supervisor via `supervisor:terminate_child/2`. As this is a
  blocking call, a process has to be spawned for it. Also, this will cause
  the top supervisor to be blocked until the group supervisor has been shut
  down, so it will not accept other requests until then.

Both of the above approaches suffer from the fact that the childs responsible
for the shutdown have to know things about their surroundings, namely...:
* ... that it is located under a (group) supervisor in the first place. In the
  second of the above approaches, even that this group supervisor is again
  located under yet another supervisor. And, conversely, it is not possible
  to run this process independently (e.g. for testing), that is, without
  the supervisor layers on top.
* ... that it may have siblings, and that it is safe to cause their shutdown
  by shutting down the supervisor. From a programmer's perspective, it is not
  obvious just by looking at the supervisor implementation that a certain
  child may cause a shutdown of the supervisor and all childs, so there is
  a potential for surprises.

This may be tackled by having a dedicated overseer child that watches the
other childs and acts according to their behavior. However, this requires
considerable boilerplate code for tasks that would be better suited in the
supervisor. Also, there is the problem that the overseer process must keep
the list of childs it watches up to date should any of them be restarted,
either by enabling the childs to register with it on start (for which they
in turn must know the overseer process' pid), or asking the supervisor
for it.

Another approach that is often used is to make the childs responsible for
the shutdown of the group supervisor permanent and the supervisor's restart
intensity to 0. This has the downside that the child will not be restarted
but cause the supervisor to shut down if it exits abnormally and _could_
be restarted. Another downside to this approach is that it produces error
messages (crash reports), even if the shutdown is intended.

Last but not least, some people have taken the approach to clone the OTP
supervisor and customize it to their needs, for reasons outlined here and
others.


Rationale
=========

This EEP provides a means to alleviate the problems outlined in the
motivation by introducing a way to mark specific childs as `significant`
via a new child spec flag, and a way to configure supervisors to shut down
automatically depending on the exit of significant childs via a new
supervisor flag.

In order to keep backwards compatibility, the new flags will only be usable
in the map forms of child specs and supervisor flags, and for the same reason
the default values for the new flags are chosen such that, in their absence,
the supervisor behaves the same as it does to date.

The new child spec flag is named `significant` with possible values `true` and
`false`, with `false` being the default.

The new supervisor flag is named `shutdown` with possible values `normal`,
`any_significant`and `all_significant`, with `normal` being the default.

(The proposed names and values are working titles.)

With the supervisor `shutdown` flag set to `normal`, the child spec flag
`significant` is ignored, even if present and set to `true`. This is intended
as a safety means to defend against unwanted breaking of old code.

Otherwise, the following rules apply when a significant child exits _on it's
own_:

* A `permanent` child will always be restarted and not cause the supervisor to
  shut down, ie the `significant` flag has no meaning for permanent childs.
 * A `transient` child will be restarted (not cause a supervisor shutdown)
   if it exits abnormally. If it exits normally...
   * if the supervisor `shutdown` flag is `any_significant`, the supervisor
     will shut down
   * if the supervisor `shutdown` flag is `all_significant`, the supervisor
     will shut down if the child was the last active significant child
* A `temporary` child will never be restarted. If it exits normally or
  abnormally, the same rules as for `transient` childs apply, in regard to
  the supervisor `shutdown` flag.

To be clear, the above rules only apply when _significant_ childs exit
_by themselves_, that is, not when other non-significant childs exit,
not as a consequence of being restarted by a sibling's death in the
`one_for_all` or `rest_for_one` strategies, and not when being terminated
manually via `supervisor:terminate_child/2`.

The approach proposed here could also be used to the effect of "shutdown when
empty" by marking _all_ childs as `significant` and setting the supervisor
`shutdown` flag to `all_significant`.


Backwards Compatibility
=======================

The changes proposed in this document introduce no incompatible changes, as
the new child spec and supervisor flags are optional and default to values
that result in the current behavior. Also, all the current workarounds
outlined in the Motivation will still work.


[OTP-PR 4521]: https://github.com/erlang/otp/pull/4521
    "supervisor: add restart type intrinsic #4521"


Copyright
=========

This document has been placed in the public domain.


[EmacsVar]: <> "Local Variables:"
[EmacsVar]: <> "mode: indented-text"
[EmacsVar]: <> "indent-tabs-mode: nil"
[EmacsVar]: <> "sentence-end-double-space: t"
[EmacsVar]: <> "fill-column: 70"
[EmacsVar]: <> "coding: utf-8"
[EmacsVar]: <> "End:"
[VimVar]: <> " vim: set fileencoding=utf-8 expandtab shiftwidth=4 softtabstop=4: "