    Author: Maria Scott <maria-12648430(at)hnc-agency(dot)org>
            Jan Uhlig <juhlig(at)hnc-agency(dot)org>
    Status: Draft
    Type: Standards Track
    Created: 16-Jun-2021
    Post-History:
    Replaces:
****
EEP XXX: Delayed restarts of supervisor children
----



Abstract
========

This EEP introduces a mechanism to delay the restart of crashed children of
supervisors.



Motivation
==========

At the time of this writing, crashed children of supervisors will be
restarted immediately. If they fail to restart, another restart
attempt is made, again immediately, until the either the restarting
succeeds or the supervisors restart intensity limit is exhausted.

This approach is rather aggressive, and there are scenarios where
frantic restarting is unfeasible and will likely achieve nothing but
a shutdown of the supervisor.
For example, the child in question may depend on an external service
like a database which in turn may be temporarily down or overloaded.

The request for a way to introduce restart delays has appeared on
the erlang questions mailing list again and again over the years
but was never implemented (though an attempt was made years ago).
We suspect that this did ultimately lead to a number of customized
clones of the OTP supervisor.

This document attempts to present a canonical, standardized way to
provide the means for delayed restarts of supervisor children, in order
to address the obvious need and discourage the creation of further
OTP supervisor clones.



Rationale
=========

This EEP introduces a new child spec key `restart_delay` which accepts
either the atom `undefined` or a non-negative integer as values, with
`undefined` being the default.

In order to reduce the amount of invalid combinations of options, `undefined`
is the only allowed value when the restart type of a child is `temporary`, as
they will never be restarted and thus a restart delay makes no sense in this
combination.

In combination with the restart types `permanent` or `transient`, the value
`undefined` maps to `0`, ie immediate restart, which is equivalent to the
current behavior.

The details of the delayed restart mechanism depend on the restart strategy of
the supervisor, as discussed below.

In the following paragraphs, only permanent and transient children and the exits
that cause a restart of such children are considered. Temporary children, as already
said, are never restarted, and are therefore not discussed.

Also, a distinction is made between _running_ and _active_ children. The term
_active_ children refers to any children that have not been stopped manually, ie
children that either represent a running process or are currently being restarted.
_Running_ children are a subset of active children, namely the children that
represent a running process, not the ones currently being restarted.



one_for_one and simple_one_for_one
----------------------------------

When the supervisor notices the crash of a child (the offender)...

1. it marks the offending child as restarting
2. it starts a timer to send a message to itself after the specified delay
3. it resumes its receive loop
4. when the timed message arrives, it will trigger the actual restart of
   the offending child

If the child fails to restart, this procedure is repeated.



one_for_all
-----------

When the supervisor notices the crash of a child (the offender)...

1. it terminates all _running_ siblings and marks them and as well as the offending
   child as restarting
2. it starts a timer to send a message to itself after the _maximum_ of the delays
   of the offending child and the siblings terminated in (1)
3. it resumes its receive loop
4. when the timed message arrives, it will trigger the actual restarts of the
   offending child and the siblings terminated in (1), in proper order

If any child fails to start, this procedure is repeated, with the failed
child becoming the new offender.

By using the maximum of the delays in (2), the children are restarted at the same time,
as a unit, without intermittend delays.

While the delay between a crash and the first restart attempt is constant, the delays
_between_ restart _retries_ are not necessarily constant. The children behind the failing one
were not tried, and by having used the maximum delay in (2), their delays need not be
considered when retrying. Put simply, only the delays of the children before the failing
one need to be considered when retrying a restart.



rest_for_one
------------

When the supervisor notices the crash of a child (the offender)...

1. it terminates all _running_ siblings behind the offending child and marks
   them as well as the offending child as restarting
2. it starts a timer to send a message to itself after the _maximum_ of the delays
   of the offending child and the siblings terminated in (1)
3. it resumes its receive loop
4. when the timed message arrives, it will trigger the actual restarts of the
   offending child and the siblings terminated in (1), in proper order

If any child fails to start, this procedure is repeated, with the failed
child becoming the new offender.

While the maximum of the delays is also used in (2) in this strategy, the
guarantees it provides, concerning restarting at the same time as a unit
without intermittend delays, are weaker: When one of the children fails to
restart, the restarting will resume from there, but after the respective delay.

Like in the one_for_all strategy, the delay between a crash and the first
restart attempt is constant, but the delays are not necessarily constant. In
rest_for_one, we can even say that the delays between retries are either the
same or decreasing. The children before the offending child are running and so
need not be considered. The children behind the offending child have not been
tried, and having used the maximum delay in (2), they need not be considered also.
Put simply, only the delay of the offending child needs to be considered when
retrying a restart.

A special case in rest_for_one is that one of the still running children may
crash while children further back are in delay. In this case, the calculation of
the new delay should take the remainder of the already running delay into consideration.



Considerations
==============

An open question related to the one_for_all and rest_for_one strategies is how adding
children dynamically via start_child/2 should be handled while some children are in delay.
While this may be an unusual practice with the given strategies, it is nevertheless allowed,
at any time.

It should be noted that this may happen with the current implementation already. While the
first restart attempt is done synchronously, restart _retries_ are interleaved with the
supervisor returning to its receive loop. Thus, if the first restart attempt fails, it may
be possible that a child is started dynamically between restart retries. However, the time
window where this may happen is very small and will not even open unless the first attempt
fails. With delayed restarts however, this window will always exist and be open for a
possibly long time.

The possible ways to address this case are the following:

* The solution that we would prefer is to unconditionally allow and perform the dynamic
  starts of children, regardless of the status of the other children. This is the simplest
  approach with the least potential for surprises. Such children may however want to be prepared
  for the case that the other processes they may rely upon (as the given strategies are
  intended to be used for groups of interdependent processes) are not there at the time they
  are starting.
* Forbid the dynamic starts of children while other children are restarting, ie return
  an error tuple or throw an exception from start_child/2. This would make it rather difficult
  for code calling start_child/2, as it would have to match the replies and decide if,
  when, and how often to retry.
* Defer the dynamic addition and thereby the answer to the start_child/2 call until the supervisor
  has reached a stable state with all children running. This may take a considerable time,
  or even never complete (see the Caveats paragraph below), and would be something unknown
  or hard to estimate for the caller. It may not even be necessary from the callers point of view.

A related question is how to handle a manual (ie forced) restart of a child in delay via
restart_child/2. The same possible solutions (and our preference) apply here.



Caveats
=======

Using delayed restarts has a potential for effectively disabling the restart intensity
limit. While there are already ways to do this deliberately, careless use of restart delays
carries a risk of disabling it unintentionally. In essence, whenever a child has a delay
close to or even greater than the restart intensity divided by restart period, the supervisor
may end up in an infinite restart cycle. It is the responsibility of the user to chose
delays and restart limits wisely, and the documentation should contain appropriate cautioning.



Copyright
=========

This document has been placed in the public domain.



[EmacsVar]: <> "Local Variables:"
[EmacsVar]: <> "mode: indented-text"
[EmacsVar]: <> "indent-tabs-mode: nil"
[EmacsVar]: <> "sentence-end-double-space: t"
[EmacsVar]: <> "fill-column: 70"
[EmacsVar]: <> "coding: utf-8"
[EmacsVar]: <> "End:"
[VimVar]: <> " vim: set fileencoding=utf-8 expandtab shiftwidth=4 softtabstop=4: "
