Re: Weird possibility with async processes, $!, and long running scripts

Robert Elz Mon, 16 Mar 2020 08:08:24 -0700

    Date:        Mon, 16 Mar 2020 13:38:58 +0100
    From:        Joerg Schilling <joerg.schill...@fokus.fraunhofer.de>
    Message-ID:  <5e6f7362.u+rw3m3sirjpta0s%joerg.schill...@fokus.fraunhofer.de>


  | Do you like to talk about what happens when pid numbers are reused?

Yes.

  | This may be a negative side-effect of PID ramomization that could reuse
  | pis numbers much earlier than without...

It makes no difference to the issue - may alter the probability of it
occurring.

  | > Does anyone know of a shell that correctly handles this now?
  |
  | I guess there is no behavior that could be called "correct",
  | since the behavior is not caused by the shell but by the kernel.

No, it is the shell causing this one, not the kernel - the kernel (or
at least, no kernel I'm aware of, from 5th edition (mid 1970's) to now)
avoid reassigning a pid to a process that exists - there are never
duplicates.

However the shell keeps jobs in its jobs table until a script does a
wait command - the shell is keeping the pid "alive" longer than it remains
alive in the kernel (and so protected from reuse).   That's the problem.

That is, this is a common shell implementation technique, this one is
incredibly hard to test without a custom kernel to force the issue (very
few available pids) so I haven't attempted to discover which shells, if
any, have any mitigation for this (or even avoid it completely).

That's why I asked.

  | Before talking about your ideas, it would be important to define what you 
  | intend.

Correctness.   Making sure that when a script evaluates $! it gets a handle
on a job that it can (hours, or days, or weeks later) reference to wait for
(or kill) the job, and know it is referencing the correct job.   Always.

  | Solaris defines PID_MAX to 999999. What value are you using?

Irrelevant.   Modern processors can run through that many processes
in almost no time - the issue is that pids are reused, eventually,
in all systems.

  | In fact, there are platforms (AIX IIRC) that implement waitid()
  | flags only with waitid(),

OK, not that I care a lot about AIX

  | but why do you care about outdated interfaces anyway?

I was anticiupating a comment like that - but this is irrelevant, if we
use the WNOWAIT method, and can somehow make it work, it is up to the
implementation to make that function on the system it is to be installed
upon, if that means using waitid() then that's what it would have to do.

And from your other message...

  | Then you would need to rewrite the shell to behave like ksh93 and install a
  | SIGCLD handler. I am not sure about possible side effects.... 

Depending what the handler does (what action the shell takes when it
receives a SIGCHLD - please spell it correctly the list of signal names,
including SIGCHLD is in XBD, see page 334 - regardless or whether that
happens in the signal handler, or in some code run later that is triggered
by the signal handler) that can only even make the problem worse (or be
neutral) though I suppose in conjunction with never doing waitid(P_ALL, ...)
or the equivalent using one of the other wait*() interfaces, there might
be a method there which could work (keep the zombie in the kernel, yet
be aware that it is a zombie, and why it exited).


Harald's suggestion (and my comment about it) "works" (as much as it does)
as it would avoid the shell using pids (which may have been reused by the
kernel) to refer to shell jobs, and instead use job designators (%1 %% etc)
which are totally under control of the shell, and so can be made safe.

Pity that "ps -p %%" doesn't work though...

kre

Re: Weird possibility with async processes, $!, and long running scripts

Reply via email to