Re: [elixir-core:12020] Preventing orphan processes on BEAM crash

José Valim Sun, 16 Feb 2025 01:00:59 -0800

Thank you for the proposal Adam. In this case, the proposal has to be sent
upstream to Erlang, as Elixir simply delegates the Port functionality to
the Erlang VM.



*José Valimhttps://dashbit.co/ <https://dashbit.co/>*


On Sun, Feb 16, 2025 at 5:46 AM Adam Wight <adam.m.wi...@gmail.com> wrote:

> While writing a library to integrate with an indivisibly long-running,
> external program (rsync), I came across the problem described in
> https://hexdocs.pm/elixir/Port.html#module-zombie-operating-system-processes
> and I think there may be some fundamental mistakes in the advice given
> there.
>
> Our analysis in the Port documentation says that a polite application will
> detect when its stdio communication pipes are closed and will then
> terminate itself.  The fact that this is the case seems to be accidental,
> and is based on an empirical observation that most applications do some
> sort of I/O, so when one of the standard file descriptors closes the
> application will encounter a read or write error and will stop.  However,
> there are plenty of applications which can and should continue beyond this
> condition, and there's even a utility `nohup(1)` for exactly the purpose of
> allowing applications to ignore problems with stdio, for example when
> they're launched and backgrounded from an interactive terminal that will be
> closed.
>
> An example of a utility which does no I/O and therefore ignores stdio file
> descriptor statuses by default is `sleep(3)`, and I don't think it would be
> correct to make it stop because stdio is closed.  Running under elixir
> provides a good demonstration of the problem we're looking at here:
>
>     elixir -e 'System.cmd(System.find_executable("sleep"), ["60"])'
>
> Start that command and then kill the BEAM, and look for the sleep
> process.  It should still be running, in process state "Ss".
>
> This will also demonstrate a second problem with the Port documentation,
> that the condition we're dealing with is an "orphan process" which is still
> running but is now unassociated with a BEAM parent and can no longer be
> controlled or communicated with by Elixir.  Orphans are a bigger issue than
> "zombie processes", which have already terminated and will show up in `ps`
> output in state "Z", because an orphan can still cause side-effects and
> consume resources.
>
> Some helpful Internet posts led me to what I believe is the correct way to
> prevent an orphan child process, by calling it through an intermediate
> application similar to the one suggested by Port docs but using `prctl(2)`
> instead, which allows the intermediate to monitor the parent process (the
> BEAM) and kill its child if the parent is terminated.  The code below still
> has a small race condition on launch, but I'll share it anyway:
>
> ```c
> #define _XOPEN_SOURCE 700
> #include <signal.h>
> #include <stddef.h>
> #include <stdlib.h>
> #include <sys/prctl.h>
> #include <sys/wait.h>
> #include <unistd.h>
>
> pid_t child_pid;
>
> void handle_signal(int signum) {
>   if (signum == SIGHUP && child_pid > 0) {
>     kill(child_pid, SIGKILL);
>   }
> }
>
> int main(int argc, char* argv[]) {
>   // Send this process a HUP if the parent BEAM VM dies.
>   // FIXME: race condition until this line, if the parent is already dead.
>   prctl(PR_SET_PDEATHSIG, SIGHUP);
>
>   // Listen for HUP and respond by killing the child process.
>   struct sigaction action;
>   action.sa_handler = handle_signal;
>   action.sa_flags = 0;
>   sigemptyset(&action.sa_mask);
>   sigaction(SIGHUP, &action, NULL);
>
>   child_pid = fork();
>   if (child_pid == 0) {
>     const char* command = argv[1];
>     for (int i = 0; i < argc; i++) {
>       argv[i] = argv[i + 1];
>     }
>     execv(command, argv);
>   } else {
>     waitpid(child_pid, NULL, 0);
>   }
>
>   return 0;
> }
> ```
>
> To try it out, save as main.c and compile like so:
>
>     cc -g -O3 -std=c99 -pedantic -o parent-monitor main.c
>
> This tool shifts its argv by one to construct the child process, so from
> the command line it would be called like `parent-monitor /usr/bin/sleep 60`
> and to exercise a BEAM crash you can call it like so (after adjusting the
> paths for your system):
>
>     elixir -e 'System.cmd("./parent-monitor", ["/usr/bin/sleep", "60"])'
>
> Now you can see the sleep process is killed as soon as the VM is stopped.
>
> Although it's a fringe issue since the BEAM is normally stopped only
> during development or when deploying new code, I feel like it could be
> useful to bundle the behavior into Elixir or Erlang itself.  This could be
> seen as an elegant extension of the OTP supervisor tree principle beyond
> the VM boundary, it seems to have some real-world consequences, and it's an
> obscure problem for an application developer to solve from scratch each
> time.
>
> There's one more use case to mention, that such a behavior should probably
> be made optional, maybe as a flag to Port.  I would imagine the default
> should be to use the wrapper, so the implicit option might look like
> `allow_orphan: false`.  The rare case where we omit the wrapper would be
> when it's more useful to let the child continue than to maintain control
> over it.
>
> Kind regards,
> Adam Wight
>
> --
> You received this message because you are subscribed to the Google Groups
> "elixir-lang-core" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elixir-lang-core+unsubscr...@googlegroups.com.
> To view this discussion visit
> https://groups.google.com/d/msgid/elixir-lang-core/CAF56aJK-R_gyTSLBmYE%3DsWMOHMgZyujAJZBO6sHO4x2tekC41w%40mail.gmail.com
> <https://groups.google.com/d/msgid/elixir-lang-core/CAF56aJK-R_gyTSLBmYE%3DsWMOHMgZyujAJZBO6sHO4x2tekC41w%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elixir-lang-core+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/elixir-lang-core/CAGnRm4%2B7LtBG5Og3xYmFXZYtikzJrYfYSwO8CBBgRQixPeovXA%40mail.gmail.com.

Re: [elixir-core:12020] Preventing orphan processes on BEAM crash

Reply via email to