Re: wait -n misses signaled subprocess

2024-02-08 Thread Chet Ramey

On 1/31/24 2:35 PM, Robert Elz wrote:


   | Not quite. `new' in this sense is the opposite of `anything in the past'
   | as Dale described it -- already notified and removed from the jobs list.

I guess the part about bash that I am not understanding here is how the
"already notified" works.   To me there are just two ways for that, either
the user has done a "wait" which has collected that pid already (either
without -n, and no pid args, or with pid args and one of those is the pid
in question) or with -n and the pid in question was the one whose status
was returned, or the user/script did the jobs command (or jobs -l) and the
job in question was shown as completed.

Is there some other way?


Notification after a job terminates due to a signal in a non-interactive
shell -- that runs the equivalent of `jobs'. As it turns out, this was the
problem with Steven Pelley's original report. I fixed one issue, but that
kind of notification will leave jobs marked as notified and eligible to
be removed from the jobs list.



   | Half the problem here is that bash aggressively marks dead jobs as being
   | notified in non-interactive shells without job control enabled, and moves
   | them out of the jobs table.

That might be more than half the problem, it might be the entire problem.


It seems to be in this case. It's a good thing it's limited to processes
that terminate due to signals; a bad thing that processes terminating due
to signals was the entire subject of the original report.



   | but if you
   | do, or if you use wait -n with pid/job arguments (which you've presumably
   | saved yourself) you're going to need slightly different semantics than we
   | have now to answer that reliably. And that will probably need a new option.

That's a pity, particularly since the current semantics don't seem to
be useful in general. 


Shoehorning pid/job arguments into the previous behavior, which only dealt
with running jobs, resulted in the current semantics. I should probably
have made `wait -n' with pid arguments look at terminated and notified
processes, but I didn't change the `running job' semantics. Hindsight.


 Since the sole issue provoking that seems to be
the wait over and over policy,


It's not a policy, per se, it's behavior that has historically worked that
way.


rather than "wait once, and remove completely"


POSIX semantics.


perhaps rather than a new, but different, -n like option, a better idea would
be a "only once" option (ie: if the option (-r (remove) or -c (cleanup) or -o
(once only)) is set, then when the wait with that option returns status or,
or waits until termination without returning status (in the not -n case, with
no pid args, or many pid args) then the processes are completely deleted from
everywhere in the shell. 


Or you could use posix mode with the recent change, already in devel, since
POSIX requires this behavior (but see below).


 Using that option would make a changed -n safe
to use in loops.   If you do that, also add an option (maybe the upper case
version of whatever is selected for that one, or just some other letter) to
mean "don't wait" (kind of like wait(2) WNOWAIT) - which in default bash would
just be a no-op (except in posix mode, apparently - whereas the -[cor] option
would be a no-op in posix mode).


You're not the only one to suggest some new option(s). Only one really
matters for this discussion.



If you were to do that, other shells could add the same (except in probably
all of them, -[cor] would always be the default, and the other one would be
the one which changes behaviour).


That's always hit or miss.



   | > The one change that should be made is
   | > to allow wait -n to collect processes/jobs that have already terminated.
   |
   | Yes, that's one of the things we're talking about. I don't have any problem
   | with it, but should it take a new option to change those semantics?

Good, though I think some more thought should go into that.   In another
thread you said (paraphrasing) correctly, that scripts should not be
relying upon bugs, and the current wait -n behaviour is a bug - that it
might have been intentionally coded that way doesn't make it any less so.


Trust me, there are people on the other side of that question.


It isn't as if it was ever documented to work the way it does, or everyone
would have known about it already.


You mean the behavior of `wait -n' with pid arguments, I presume. The
problem with your statement is that people do know about it. The question,
as above, is whether or not to avoid changing the behavior because they do.

There are two things that we could change:

1. wait -n needs to get access to the list of terminated pids (the ones
   that satisfy POSIX's "CHILD_MAX processes known in the current shell
   environment"), like wait without -n does. This can happen via a wait
   option, a shell option, or a change in behavior controlled by the
   compatibility level.

2. Some option to implement the 

Re: wait -n misses signaled subprocess

2024-02-01 Thread alex xmb sw ratchev
On Thu, Feb 1, 2024, 09:09 alex xmb sw ratchev  wrote:

>
>
> On Wed, Jan 31, 2024, 20:36 Robert Elz  wrote:
>
>> Date:Wed, 31 Jan 2024 11:35:57 -0500
>> From:Chet Ramey 
>> Message-ID:  <1e50aa99-8d53-4cdf-ba5e-6aaf3ccc6...@case.edu>
>>
>>   | Not quite. `new' in this sense is the opposite of `anything in the
>> past'
>>   | as Dale described it -- already notified and removed from the jobs
>> list.
>>
>> I guess the part about bash that I am not understanding here is how the
>> "already notified" works.   To me there are just two ways for that, either
>> the user has done a "wait" which has collected that pid already (either
>> without -n, and no pid args, or with pid args and one of those is the pid
>> in question) or with -n and the pid in question was the one whose status
>> was returned, or the user/script did the jobs command (or jobs -l) and the
>> job in question was shown as completed.
>>
>
> i say additional datastructure for the saving purpose ..
>

it d need new uid , real-unique-id , or some special hash of the
jobs/pids/cmdlines

Is there some other way?
>>
>>   | Half the problem here is that bash aggressively marks dead jobs as
>> being
>>   | notified in non-interactive shells without job control enabled, and
>> moves
>>   | them out of the jobs table.
>>
>> That might be more than half the problem, it might be the entire problem.
>>
>>   | If you use wait -n without arguments, you probably don't care,
>>
>> No you do, that just means any of the children ... the script could make
>> a list of all of them and supply that list, but if the list is just going
>> to contain all the existing children, why bother?(With -n - and not
>> exactly one pid arg, -p is generally going to be required, but that option
>> has no bearing on which process is selected, or might be, which is the
>> issue here).
>>
>>   | but if you
>>   | do, or if you use wait -n with pid/job arguments (which you've
>> presumably
>>   | saved yourself) you're going to need slightly different semantics
>> than we
>>   | have now to answer that reliably. And that will probably need a new
>> option.
>>
>> That's a pity, particularly since the current semantics don't seem to
>> be useful in general.   Since the sole issue provoking that seems to be
>> the wait over and over policy, rather than "wait once, and remove
>> completely"
>> perhaps rather than a new, but different, -n like option, a better idea
>> would
>> be a "only once" option (ie: if the option (-r (remove) or -c (cleanup)
>> or -o
>> (once only)) is set, then when the wait with that option returns status
>> or,
>> or waits until termination without returning status (in the not -n case,
>> with
>> no pid args, or many pid args) then the processes are completely deleted
>> from
>> everywhere in the shell.   Using that option would make a changed -n safe
>> to use in loops.   If you do that, also add an option (maybe the upper
>> case
>> version of whatever is selected for that one, or just some other letter)
>> to
>> mean "don't wait" (kind of like wait(2) WNOWAIT) - which in default bash
>> would
>> just be a no-op (except in posix mode, apparently - whereas the -[cor]
>> option
>> would be a no-op in posix mode).
>>
>> If you were to do that, other shells could add the same (except in
>> probably
>> all of them, -[cor] would always be the default, and the other one would
>> be
>> the one which changes behaviour).
>>
>>   | And that's why I used `more': there are several differences, so which
>>   | of those differences should we attempt to change?
>>
>> Just the one.
>>
>>   | > The one change that should be made is
>>   | > to allow wait -n to collect processes/jobs that have already
>> terminated.
>>   |
>>   | Yes, that's one of the things we're talking about. I don't have any
>> problem
>>   | with it, but should it take a new option to change those semantics?
>>
>> Good, though I think some more thought should go into that.   In another
>> thread you said (paraphrasing) correctly, that scripts should not be
>> relying upon bugs, and the current wait -n behaviour is a bug - that it
>> might have been intentionally coded that way doesn't make it any less so.
>> It isn't as if it was ever documented to work the way it does, or everyone
>> would have known about it already.
>>
>>   | > Changing it to wait for all the listed pids
>>   | It's never done that.
>>   | We're not going to change the return value from wait.
>>
>> Good, I only mentioned those possibilities because your earlier
>> message was unclear about what "more like wait without -n" meant.
>>
>>   | Yeah, but we're talking about bash here. It doesn't really matter what
>>   | the Bourne shell did; there are likely plenty of scripts that assume
>>   | the historical bash behavior.
>>
>> Really?   Why?   What's the point of collecting the status twice?
>> It can't change in the meantime can it, once a process has done exit(N)
>> its exit status should always be N, 

Re: wait -n misses signaled subprocess

2024-02-01 Thread alex xmb sw ratchev
On Wed, Jan 31, 2024, 20:36 Robert Elz  wrote:

> Date:Wed, 31 Jan 2024 11:35:57 -0500
> From:Chet Ramey 
> Message-ID:  <1e50aa99-8d53-4cdf-ba5e-6aaf3ccc6...@case.edu>
>
>   | Not quite. `new' in this sense is the opposite of `anything in the
> past'
>   | as Dale described it -- already notified and removed from the jobs
> list.
>
> I guess the part about bash that I am not understanding here is how the
> "already notified" works.   To me there are just two ways for that, either
> the user has done a "wait" which has collected that pid already (either
> without -n, and no pid args, or with pid args and one of those is the pid
> in question) or with -n and the pid in question was the one whose status
> was returned, or the user/script did the jobs command (or jobs -l) and the
> job in question was shown as completed.
>

i say additional datastructure for the saving purpose ..

Is there some other way?
>
>   | Half the problem here is that bash aggressively marks dead jobs as
> being
>   | notified in non-interactive shells without job control enabled, and
> moves
>   | them out of the jobs table.
>
> That might be more than half the problem, it might be the entire problem.
>
>   | If you use wait -n without arguments, you probably don't care,
>
> No you do, that just means any of the children ... the script could make
> a list of all of them and supply that list, but if the list is just going
> to contain all the existing children, why bother?(With -n - and not
> exactly one pid arg, -p is generally going to be required, but that option
> has no bearing on which process is selected, or might be, which is the
> issue here).
>
>   | but if you
>   | do, or if you use wait -n with pid/job arguments (which you've
> presumably
>   | saved yourself) you're going to need slightly different semantics than
> we
>   | have now to answer that reliably. And that will probably need a new
> option.
>
> That's a pity, particularly since the current semantics don't seem to
> be useful in general.   Since the sole issue provoking that seems to be
> the wait over and over policy, rather than "wait once, and remove
> completely"
> perhaps rather than a new, but different, -n like option, a better idea
> would
> be a "only once" option (ie: if the option (-r (remove) or -c (cleanup) or
> -o
> (once only)) is set, then when the wait with that option returns status or,
> or waits until termination without returning status (in the not -n case,
> with
> no pid args, or many pid args) then the processes are completely deleted
> from
> everywhere in the shell.   Using that option would make a changed -n safe
> to use in loops.   If you do that, also add an option (maybe the upper case
> version of whatever is selected for that one, or just some other letter) to
> mean "don't wait" (kind of like wait(2) WNOWAIT) - which in default bash
> would
> just be a no-op (except in posix mode, apparently - whereas the -[cor]
> option
> would be a no-op in posix mode).
>
> If you were to do that, other shells could add the same (except in probably
> all of them, -[cor] would always be the default, and the other one would be
> the one which changes behaviour).
>
>   | And that's why I used `more': there are several differences, so which
>   | of those differences should we attempt to change?
>
> Just the one.
>
>   | > The one change that should be made is
>   | > to allow wait -n to collect processes/jobs that have already
> terminated.
>   |
>   | Yes, that's one of the things we're talking about. I don't have any
> problem
>   | with it, but should it take a new option to change those semantics?
>
> Good, though I think some more thought should go into that.   In another
> thread you said (paraphrasing) correctly, that scripts should not be
> relying upon bugs, and the current wait -n behaviour is a bug - that it
> might have been intentionally coded that way doesn't make it any less so.
> It isn't as if it was ever documented to work the way it does, or everyone
> would have known about it already.
>
>   | > Changing it to wait for all the listed pids
>   | It's never done that.
>   | We're not going to change the return value from wait.
>
> Good, I only mentioned those possibilities because your earlier
> message was unclear about what "more like wait without -n" meant.
>
>   | Yeah, but we're talking about bash here. It doesn't really matter what
>   | the Bourne shell did; there are likely plenty of scripts that assume
>   | the historical bash behavior.
>
> Really?   Why?   What's the point of collecting the status twice?
> It can't change in the meantime can it, once a process has done exit(N)
> its exit status should always be N, regardless of how often it is waited
> upon.
>
> [Aside: this should be obvious, but when one is collecting status changes,
> rather than just "terminated" status, then the pid isn't removed if it
> returns a "stopped" or "continued" status.]
>
>   | > I meant the distinction 

Re: wait -n misses signaled subprocess

2024-01-31 Thread Robert Elz
Date:Wed, 31 Jan 2024 11:35:57 -0500
From:Chet Ramey 
Message-ID:  <1e50aa99-8d53-4cdf-ba5e-6aaf3ccc6...@case.edu>

  | Not quite. `new' in this sense is the opposite of `anything in the past'
  | as Dale described it -- already notified and removed from the jobs list.

I guess the part about bash that I am not understanding here is how the
"already notified" works.   To me there are just two ways for that, either
the user has done a "wait" which has collected that pid already (either
without -n, and no pid args, or with pid args and one of those is the pid
in question) or with -n and the pid in question was the one whose status
was returned, or the user/script did the jobs command (or jobs -l) and the
job in question was shown as completed.

Is there some other way?

  | Half the problem here is that bash aggressively marks dead jobs as being
  | notified in non-interactive shells without job control enabled, and moves
  | them out of the jobs table.

That might be more than half the problem, it might be the entire problem.

  | If you use wait -n without arguments, you probably don't care,

No you do, that just means any of the children ... the script could make
a list of all of them and supply that list, but if the list is just going
to contain all the existing children, why bother?(With -n - and not
exactly one pid arg, -p is generally going to be required, but that option
has no bearing on which process is selected, or might be, which is the
issue here).

  | but if you
  | do, or if you use wait -n with pid/job arguments (which you've presumably
  | saved yourself) you're going to need slightly different semantics than we
  | have now to answer that reliably. And that will probably need a new option.

That's a pity, particularly since the current semantics don't seem to
be useful in general.   Since the sole issue provoking that seems to be
the wait over and over policy, rather than "wait once, and remove completely"
perhaps rather than a new, but different, -n like option, a better idea would
be a "only once" option (ie: if the option (-r (remove) or -c (cleanup) or -o
(once only)) is set, then when the wait with that option returns status or,
or waits until termination without returning status (in the not -n case, with
no pid args, or many pid args) then the processes are completely deleted from
everywhere in the shell.   Using that option would make a changed -n safe
to use in loops.   If you do that, also add an option (maybe the upper case
version of whatever is selected for that one, or just some other letter) to
mean "don't wait" (kind of like wait(2) WNOWAIT) - which in default bash would
just be a no-op (except in posix mode, apparently - whereas the -[cor] option
would be a no-op in posix mode).

If you were to do that, other shells could add the same (except in probably
all of them, -[cor] would always be the default, and the other one would be
the one which changes behaviour).

  | And that's why I used `more': there are several differences, so which
  | of those differences should we attempt to change?

Just the one.

  | > The one change that should be made is
  | > to allow wait -n to collect processes/jobs that have already terminated.
  |
  | Yes, that's one of the things we're talking about. I don't have any problem
  | with it, but should it take a new option to change those semantics?

Good, though I think some more thought should go into that.   In another
thread you said (paraphrasing) correctly, that scripts should not be
relying upon bugs, and the current wait -n behaviour is a bug - that it
might have been intentionally coded that way doesn't make it any less so.
It isn't as if it was ever documented to work the way it does, or everyone
would have known about it already.

  | > Changing it to wait for all the listed pids
  | It's never done that.
  | We're not going to change the return value from wait.

Good, I only mentioned those possibilities because your earlier
message was unclear about what "more like wait without -n" meant.

  | Yeah, but we're talking about bash here. It doesn't really matter what
  | the Bourne shell did; there are likely plenty of scripts that assume
  | the historical bash behavior.

Really?   Why?   What's the point of collecting the status twice?
It can't change in the meantime can it, once a process has done exit(N)
its exit status should always be N, regardless of how often it is waited
upon.

[Aside: this should be obvious, but when one is collecting status changes,
rather than just "terminated" status, then the pid isn't removed if it
returns a "stopped" or "continued" status.]

  | > I meant the distinction between processes
  | > that the shell has already collected status for, and those for which it

  | You're not the first to propose something like that, but I'm not going to
  | be writing that code any time soon.

Nor am I, if you go back to the message where I first mentioned it,
which I can't locate 

Re: wait -n misses signaled subprocess

2024-01-31 Thread Chet Ramey

On 1/30/24 12:40 PM, Robert Elz wrote:


   | since this was the way -n worked orginally, before it started
   | paying attention to pid arguments.

I'm not sure what the "this" is there, if you meant as I described it
in my answer to your rhetorical question, viz:

Find, or if there are none already, wait*(2) for, [...]
If there's already a terminated job [...] then no wait type
sys call gets performed

then that seems to be in conflict with some of your other statements
like:


I won't ask you to look at the code, but yes, that's pretty much what it
did: polled dead jobs to see if any could be returned because the user
had not been notified, then made sure there were actual running background
jobs and waited for one of them and returned the first one that exited.



chet.ra...@case.edu said (replying to Dale R. Worley):
   | > It looks like the underlying meaning of "-n" is to only pay attention to
   | > *new* job completions, and anything "in the past" (already notified and
   | > moved to the table of terminated background jobs) is ignored.
   | That was the original implementation, yes.

which is a different thing entirely.


Not quite. `new' in this sense is the opposite of `anything in the past'
as Dale described it -- already notified and removed from the jobs list.
Jobs in the jobs list that hadn't been marked as notified were eligible
to be returned, because to the user, they're new.

Half the problem here is that bash aggressively marks dead jobs as being
notified in non-interactive shells without job control enabled, and moves
them out of the jobs table.



   | Right -- it works on the list of running background jobs.

I know it is hard, but for determining what should happen, we need to
keep thoughts of the current implementation details out of this, as
while I'm sure you know exactly what that means, most others will not.


It's pretty much the original implementation as I described it above. The
running background jobs part kicks in after the `dead but not notified'
part.



What matters (to a script writer) is whether or not the processes listed
(if any) have had their status collected before or not - if not, then
any process (job) eligible (in the arg list of pids if there is one, or
just any) which has returned some status should be returned (if there
are multiple, any one of them) and if there are none, then we wait(2)
until one does change status.   What exactly "Running background jobs"
means there is not clear (to me anyway).


OK.


What's the mechanism by
which they find out which processes are in the state where the current version
of wait -n will work on them?Assume there are multiple running (or
perhaps recently ended) processes, and we want to process each as it
ends (or soon after, given multiple might end around the same time).


If you use wait -n without arguments, you probably don't care, but if you
do, or if you use wait -n with pid/job arguments (which you've presumably
saved yourself) you're going to need slightly different semantics than we
have now to answer that reliably. And that will probably need a new option.



   | The real question is whether or not
   | we should extend `wait -n' to behave more like `wait' without options.

That's not an answerable question, as there are several differences
between wait -n and wait without -n (which is what I assume you mean
by "wait without options").  


The bash/posix semantics for `wait' without -n, for which you can ignore -p
and -f.

And that's why I used `more': there are several differences, so which
of those differences should we attempt to change?



The one change that should be made is
to allow wait -n to collect processes/jobs that have already terminated.


Yes, that's one of the things we're talking about. I don't have any problem
with it, but should it take a new option to change those semantics?


Changing it to wait for all the listed pids (which would make it behave
more like wait without -n) is not desirable. 


It's never done that.


Nor is changing a simple
"wait -n" (no pid args, the presence, or not, of -p or -f is irrelevant)
to always exit with status 0 - which is what "wait" does.   So, please
be clear.


We're not going to change the return value from wait.



   | Why impose that requirement when it's never existed before?

Never existed before in what?   In bash, perhaps.   In standard Bourne
shells (and POSIX), this isn't at all new, it has always been required
to wait for background processes (or allow the list of saved status
to overflow, and old ones to be discarded).  


Yeah, but we're talking about bash here. It doesn't really matter what
the Bourne shell did; there are likely plenty of scripts that assume
the historical bash behavior.


There was never any
implicit "clean up when X happens" which is what bash seems to do
(in non-interactive shells, interactive ones clean up before PS1 is
written).


And?



   | Bash `wait' already has -f to return only 

Re: wait -n misses signaled subprocess

2024-01-30 Thread Chet Ramey

On 1/30/24 4:28 PM, Chet Ramey wrote:


It's not a bug, bash has allowed multiple waits for the same pid for
decades. bash works the way posix says it should for wait (without -n)
in posix mode.


I think this is a bug in bash posix mode, actually. `wait -n' should
remove the job completely, since it's been `successfully waited for'
and the language you quoted came out of interp 1254 and will be in the
next revision.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait -n misses signaled subprocess

2024-01-30 Thread Chet Ramey

On 1/30/24 2:30 PM, Robert Elz wrote:


   | If wait -n
   | looked at terminated processes you'd return jobs repeatedly and
   | possibly end up in an infinite loop.

That's another bash bug, POSIX says:


It's not a bug, bash has allowed multiple waits for the same pid for
decades. bash works the way posix says it should for wait (without -n)
in posix mode.



With wait -n, the shell should look to see if any of the process id's
listed is currently terminated, and if so, return status of one of those
(and remove it from the lists).   If none are terminated, it should look
to see if any of the pids are for non-terminated jobs (or processes) and
if so, just do a wait() until some child changes status.   If that one is
one that is in the list being waited for, then return its status (and remove
it from the lists) otherwise just change the status of that process in the
lists (including remembering the exit status if that is what this was), and
wait() again - eventually one of them should change status (that or the
shell will be interrupted by a signal, ending the wait utility).   If none
of the pids given in the arg list are known to the shell then it should
return 127.


We can have these different semantics with a new option.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait -n misses signaled subprocess

2024-01-30 Thread Robert Elz
Date:Tue, 30 Jan 2024 10:14:10 -0500
From:Steven Pelley 
Message-ID:  


  | If wait -n
  | looked at terminated processes you'd return jobs repeatedly and
  | possibly end up in an infinite loop.

That's another bash bug, POSIX says:

Once a process ID that is known in the current shell execution
environment (see Section 2.13, on page 2522) has been successfully
waited for, it shall be removed from the list of process IDs that
are known in the current shell execution environment. If the process
ID is associated with a background job, the corresponding job shall
also be removed from the list of background jobs.

That is, if you wait for the same pid again, then all you can get is a
127 status (that pid is not known, or should not be).

With wait -n, the shell should look to see if any of the process id's
listed is currently terminated, and if so, return status of one of those
(and remove it from the lists).   If none are terminated, it should look
to see if any of the pids are for non-terminated jobs (or processes) and
if so, just do a wait() until some child changes status.   If that one is
one that is in the list being waited for, then return its status (and remove
it from the lists) otherwise just change the status of that process in the
lists (including remembering the exit status if that is what this was), and
wait() again - eventually one of them should change status (that or the
shell will be interrupted by a signal, ending the wait utility).   If none
of the pids given in the arg list are known to the shell then it should
return 127.

Do that, properly, and the loop will always terminate, whether or not
you remove each pid from the list of pending ones as its status is returned.

bash's habit of holding these things forever is weird, but certainly explains
some of Chet's concerns with list sizes and such.


Incidentally, the example code given is not a good example of the issue.
In that, if the first background sleep is allowed to finish, before the
wait -n loop starts, bash still returns its status (achieve that by making
the sleep's be for longer, except the first, then add a (fg) sleep 2 into
the script before the loop starts.   Whatever condition is required to
trigger the behaviour that is being objected to doesn't occur in that case.

kre




Re: wait -n misses signaled subprocess

2024-01-30 Thread Robert Elz
Date:Tue, 30 Jan 2024 09:16:47 -0500
From:Chet Ramey 
Message-ID:  <95841ed3-ec4f-4b17-802c-86e560b58...@case.edu>

  | since this was the way -n worked orginally, before it started
  | paying attention to pid arguments.

I'm not sure what the "this" is there, if you meant as I described it
in my answer to your rhetorical question, viz:

Find, or if there are none already, wait*(2) for, [...]
If there's already a terminated job [...] then no wait type
sys call gets performed

then that seems to be in conflict with some of your other statements
like:

chet.ra...@case.edu said (replying to Dale R. Worley):
  | > It looks like the underlying meaning of "-n" is to only pay attention to
  | > *new* job completions, and anything "in the past" (already notified and
  | > moved to the table of terminated background jobs) is ignored.
  | That was the original implementation, yes.

which is a different thing entirely.

  | Right -- it works on the list of running background jobs.

I know it is hard, but for determining what should happen, we need to
keep thoughts of the current implementation details out of this, as
while I'm sure you know exactly what that means, most others will not.

What matters (to a script writer) is whether or not the processes listed
(if any) have had their status collected before or not - if not, then
any process (job) eligible (in the arg list of pids if there is one, or
just any) which has returned some status should be returned (if there
are multiple, any one of them) and if there are none, then we wait(2)
until one does change status.   What exactly "Running background jobs"
means there is not clear (to me anyway).

But if it were to mean only processes that haven't previously terminated,
how is the script writer meant to handle that?   What's the mechanism by
which they find out which processes are in the state where the current version
of wait -n will work on them?Assume there are multiple running (or
perhaps recently ended) processes, and we want to process each as it
ends (or soon after, given multiple might end around the same time).

  | The real question is whether or not
  | we should extend `wait -n' to behave more like `wait' without options.

That's not an answerable question, as there are several differences
between wait -n and wait without -n (which is what I assume you mean
by "wait without options").   The one change that should be made is
to allow wait -n to collect processes/jobs that have already terminated.
Changing it to wait for all the listed pids (which would make it behave
more like wait without -n) is not desirable.   Nor is changing a simple
"wait -n" (no pid args, the presence, or not, of -p or -f is irrelevant)
to always exit with status 0 - which is what "wait" does.   So, please
be clear.

  | Why impose that requirement when it's never existed before?

Never existed before in what?   In bash, perhaps.   In standard Bourne
shells (and POSIX), this isn't at all new, it has always been required
to wait for background processes (or allow the list of saved status
to overflow, and old ones to be discarded).   There was never any
implicit "clean up when X happens" which is what bash seems to do
(in non-interactive shells, interactive ones clean up before PS1 is
written).

  | Bash `wait' already has -f to return only when the specified job(s) has
  | terminated, reserving -t for some future use.

No, that's what I meant, -f is making the distinction between terminated
and some other status change.   I meant the distinction between processes
that the shell has already collected status for, and those for which it
is yet to do so - ie: to add an option more or less equiv to WNOHANG in
the wait*(2) sys calls (the ones that have flags).   The shell could
simply never do a wait(2) family sys call when the option is set, or if
it does one, to see if there might be a zombie waiting to be reaped,
then it should set WNOHANG when it does, to avoid the script from pausing.

  | There's no reason to keep thousands of terminated jobs in the jobs list,
  | slowing everything down, as long as you give users a way to retrieve their
  | status.

This is just implementation detail, as long as it behaves correctly,
what optimisations the implementation chooses to make are irrelevant.

  | You can run thousands of background jobs in a loop without exceeding the
  | max process limit.

It depends just what those jobs are.   For something like

while true; do :& done

then yes, sure as the jobs all terminate quite quickly, and as the shell
collects the zombies as soon as they become available (more or less) the
limit never gets reached.

But those kinds of things are rarely useful to anyone except those doing
torture tests.

More likely would be something like

while true; do sleep 1 & done

where the "sleep" is just a placeholder for anything meaningful which is
going to take appreciable time to complete.  In 

Re: wait -n misses signaled subprocess

2024-01-30 Thread Steven Pelley
Apologies for a typo:
With the discussed change this would return 44080: 1 in an endless loop.
1, not 0

On Tue, Jan 30, 2024 at 10:14 AM Steven Pelley  wrote:
>
> > OK. Can you think of a use case that would break if wait -n looked at
> > terminated processes?
>
> Yes.  If one were to start a number of bg jobs and repeatedly send the
> list of pids to wait -n (probably redirecting stderr to /dev/null to
> ignore messages about unknown jobs) today you'd process the jobs one
> at a time, assuming no races between job completion.  If wait -n
> looked at terminated processes you'd return jobs repeatedly and
> possibly end up in an infinite loop.
>
> Example:
> # associate array used for consistency with later example
> declare -A pids
> { sleep 1; exit 1; } &
> pids[$!]=""
> { sleep 2; exit 2; } &
> pids[$!]=""
> { sleep 3; exit 3; } &
> pids[$!]=""
>
> status=0
> while [ $status -ne 127 ]; do
> unset finished_pid
> wait -n -p finished_pid "${!pids[@]}" 2>/dev/null
> status=$?
> if [ -n "$finished_pid" ]; then
> echo "$finished_pid: $status @${SECONDS}"
> fi;
> done
>
> gives a simple output like:
> 44080: 1 @1
> 44081: 2 @2
> 44083: 3 @3
>
> With the discussed change this would return 44080: 0 in an endless loop.
> It would need to change to:
>
> while [ ${#pids[@]} -ne 0 ]; do
> unset finished_pid
> wait -n -p finished_pid "${!pids[@]}"
> status=$?
> if [ -n "$finished_pid" ]; then
> echo "$finished_pid: $status @${SECONDS}"
> fi;
> unset pids[$finished_pid]
> done
>
> Where the returned pid is unset in the array.  I like this more but it
> will break scripts that run correctly today.



Re: wait -n misses signaled subprocess

2024-01-30 Thread Chet Ramey

On 1/30/24 10:14 AM, Steven Pelley wrote:

OK. Can you think of a use case that would break if wait -n looked at
terminated processes?


Yes.  If one were to start a number of bg jobs and repeatedly send the
list of pids to wait -n (probably redirecting stderr to /dev/null to
ignore messages about unknown jobs) today you'd process the jobs one
at a time, assuming no races between job completion.  If wait -n
looked at terminated processes you'd return jobs repeatedly and
possibly end up in an infinite loop.


OK, that argues for a new option to provide this functionality.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait -n misses signaled subprocess

2024-01-30 Thread Steven Pelley
> OK. Can you think of a use case that would break if wait -n looked at
> terminated processes?

Yes.  If one were to start a number of bg jobs and repeatedly send the
list of pids to wait -n (probably redirecting stderr to /dev/null to
ignore messages about unknown jobs) today you'd process the jobs one
at a time, assuming no races between job completion.  If wait -n
looked at terminated processes you'd return jobs repeatedly and
possibly end up in an infinite loop.

Example:
# associate array used for consistency with later example
declare -A pids
{ sleep 1; exit 1; } &
pids[$!]=""
{ sleep 2; exit 2; } &
pids[$!]=""
{ sleep 3; exit 3; } &
pids[$!]=""

status=0
while [ $status -ne 127 ]; do
unset finished_pid
wait -n -p finished_pid "${!pids[@]}" 2>/dev/null
status=$?
if [ -n "$finished_pid" ]; then
echo "$finished_pid: $status @${SECONDS}"
fi;
done

gives a simple output like:
44080: 1 @1
44081: 2 @2
44083: 3 @3

With the discussed change this would return 44080: 0 in an endless loop.
It would need to change to:

while [ ${#pids[@]} -ne 0 ]; do
unset finished_pid
wait -n -p finished_pid "${!pids[@]}"
status=$?
if [ -n "$finished_pid" ]; then
echo "$finished_pid: $status @${SECONDS}"
fi;
unset pids[$finished_pid]
done

Where the returned pid is unset in the array.  I like this more but it
will break scripts that run correctly today.



Re: wait -n misses signaled subprocess

2024-01-30 Thread Chet Ramey

On 1/30/24 9:11 AM, Steven Pelley wrote:

It does look in the table of saved exit statuses, returning 1.


It doesn't. In this case, the code path it follows marks the job as dead
but doesn't mark it as notified (since it exited normally), so it's still
in the jobs list when `wait -n' is called, and available for returning.
That's probably a bug there.


Got it.  So wait -n is intended to behave just as the documentation
says -- "next" job -- and if there's a bug it's with how
normally-exiting processes are handled, not signal-exiting processes.
Thank you for your patience.


This has raised several other questions: whether `wait -n' should work
more like `wait' (see below) and whether non-interactive shells without
job control enabled should be so aggressive at marking jobs as notified,
since it's that state that allows them to move to the list of terminated
processes.


There's also an interaction in that "wait" will only look at the
terminated table if "-n" is not specified *and* ids are specified.


This is to maintain POSIX semantics, with extensions. This is one of the
issues -- should `wait -n' with arguments look for terminated processes
in that table, the way `wait' without options does?


Yes, I do want wait -n to look in the terminated table, at least for
my use case responding to jobs finishing, one at a time, as soon as
possible. 


OK. Can you think of a use case that would break if wait -n looked at
terminated processes?



I _don't_ want bash to maintain some sort of internal state about
which jobs have and haven't been returned by wait -n, which would be
complicated and brittle (this is what my mental model was).  I'd want
it to look in  the terminated table for finished jobs amongst the
provided list of pids, and then I'd manage the list of pids myself,
removing pids that were previously returned from wait -n.  This is a
change in semantics and might introduce inconsistencies and difficulty
implementing, I'm just describing what I think would be useful for my
specific needs.


It's not difficult to implement.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait -n misses signaled subprocess

2024-01-30 Thread Chet Ramey

On 1/29/24 3:49 PM, Robert Elz wrote:

 Date:Mon, 29 Jan 2024 12:07:53 -0500
 From:Chet Ramey 
 Message-ID:  

   | What does `wait -n' without job arguments mean?

Find, or if there are none already, wait*(2) for, a process (job technically)
that has changed state (terminated in POSIX, and one day in the NetBSD
shell, that difference isn't relevant here) and return its status.
If there's already a terminated job (job which has changed status in bash)
then no wait type sys call gets performed (that already happened).


That was mostly a rhetorical question, since this was the way -n worked
orginally, before it started paying attention to pid arguments.

Implicit here is the notification/changed state issue we've been
discussing.



It also returns the status of that process, rather than simple "0" which
a bare "wait" does (and with the appropriate arg, tells you which process
it was).


Not originally, but -p var was a useful addition.



   | OK. Since wait without options can already wait for the same pid multiple
   | times, the -n option has to bring some new functionality here.

Yes, without args, it waits until all listed arg processes (jobs) are
finished (or changed state) and returns the status of the last.   With -n
it waits for any one of them, just as the bash man page says it will.
The "any one" (vs "all") is the new functionality.


Right -- it works on the list of running background jobs.



   | As long as it's still in the jobs list.

Yes, of course - the final para of my message covered that case.

   | OK. We can agree there shouldn't be any difference between `wait pid'
   | and `wait -n pid'.

Yes, but just because that's a degenerate case of the more general commands,
which happens in each case to devolve into the same thing.


Add more `pid' arguments, if you like. The real question is whether or not
we should extend `wait -n' to behave more like `wait' without options.


And from a different message:

chet.ra...@case.edu said:
   | So should the shell require the user to periodically run `wait' in a non-
   | interactive shell without job control to clean dead jobs out of the jobs
   | list? I don't think so.

I do.   wait or jobs ("jobs >/dev/null" is a nice simple clean up, without
the potential hang waiting for things to terminate that the wait utility
imposes). 


Why impose that requirement when it's never existed before? If you want to
do it, go ahead, but we shouldn't be making that a requirement now.

  A new option to wait(1) (either a simple one, perhaps -t, to

only wait for already terminated jobs,


Bash `wait' already has -f to return only when the specified job(s) has
terminated, reserving -t for some future use.


Of course, you're also allowed to dump processes from the lists if there
get to be too many of them, but on modern systems, it really should be
possible to retain hundreds, if not thousands, without any real problem.


There's no reason to keep thousands of terminated jobs in the jobs list,
slowing everything down, as long as you give users a way to retrieve their
status.



It's also a bit unusual for non-interactive code to run lots of async jobs
without waiting for results - doing that is a sure way to run into the
"max user processes" limit, and have things start failing.  


You can run thousands of background jobs in a loop without exceeding the
max process limit. People doing that is what got us here in the first
place.

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait -n misses signaled subprocess

2024-01-30 Thread Steven Pelley
> > It does look in the table of saved exit statuses, returning 1.
>
> It doesn't. In this case, the code path it follows marks the job as dead
> but doesn't mark it as notified (since it exited normally), so it's still
> in the jobs list when `wait -n' is called, and available for returning.
> That's probably a bug there.

Got it.  So wait -n is intended to behave just as the documentation
says -- "next" job -- and if there's a bug it's with how
normally-exiting processes are handled, not signal-exiting processes.
Thank you for your patience.

> > There's also an interaction in that "wait" will only look at the
> > terminated table if "-n" is not specified *and* ids are specified.
>
> This is to maintain POSIX semantics, with extensions. This is one of the
> issues -- should `wait -n' with arguments look for terminated processes
> in that table, the way `wait' without options does?

Yes, I do want wait -n to look in the terminated table, at least for
my use case responding to jobs finishing, one at a time, as soon as
possible.  I don't think wait -n can reliably do this since there is
always a race between a job finishing/being handled, the next job
finishing, and the subsequent call to wait -n.  Even if I query "jobs"
to see if multiple jobs have terminated, the next finishing job could
still race.  You've pointed out clearly that my mental model of wait
-n was wrong so please bear with me if I still don't have this right.

Is there some other best practice for this use case?  It might be "use
a SIGCHLD handler and query jobs to see what jobs have terminated,
then call wait  on each" or "I don't recommend using bash/sh for
this."  Obviously I could also be overlooking some aspect of wait -n
or other bash features that would help here.

I _don't_ want bash to maintain some sort of internal state about
which jobs have and haven't been returned by wait -n, which would be
complicated and brittle (this is what my mental model was).  I'd want
it to look in  the terminated table for finished jobs amongst the
provided list of pids, and then I'd manage the list of pids myself,
removing pids that were previously returned from wait -n.  This is a
change in semantics and might introduce inconsistencies and difficulty
implementing, I'm just describing what I think would be useful for my
specific needs.

A bit of brainstorming: between Linux's pidfds and BSD's
kqueue/process descriptors one ought to be able to build this as an
external command that polls for non-child processes to terminate.  It
couldn't return an exit status, but it could at least indicate which
process finished or couldn't be found and thus had already finished.
Then you could use posix "wait " to get the exit status and be
guaranteed that it wouldn't block (a simple timeout option to wait
might be useful here for cases where bash's child process may not be
visible to an external command).  I'm not aware of anything like this
existing, but it would be a nice way to separate this functionality
from the shell, reduce the number of options in wait, and support
other shells.

Again, thanks for your patience Chet,
Steve



Re: wait -n misses signaled subprocess

2024-01-30 Thread Chet Ramey

On 1/28/24 10:26 PM, Dale R. Worley wrote:

Chet Ramey  writes:

echo "wait -n $pid return code $? @${SECONDS} (BUG)"


The job isn't in the jobs table because you've already been notified about
it and it's not `new', you get the unknown job error status.


The man page gives a lot of details and I'm trying to digest them into a
structure.

It looks like the underlying meaning of "-n" is to only pay attention to
*new* job completions, and anything "in the past" (already notified and
moved to the table of terminated background jobs) is ignored.


That was the original implementation, yes. The idea was to add something
to augment the `wait for all' strategy of wait without pid arguments.


The underlying meaning of providing one or more ids is that "wait" is to
only be concerned with those jobs.


Right, that's the current operation.



The man page doesn't make clear that if you don't specify "-n" and do
supply ids and one of them has already terminated, you'll get its status
(from the terminated table); the wording suggests that "wait" will
always *wait for* a termination.


Only if your mental model of the operation links the wait builtin and wait
system call. If the pid has already terminated, the wait is immediate, and
there's no reason to call the system call, but wait still returns the
status.

(As an aside, that is one long paragraph describing `wait'. I need to
break that up.)



There's also an interaction in that "wait" will only look at the
terminated table if "-n" is not specified *and* ids are specified.


This is to maintain POSIX semantics, with extensions. This is one of the
issues -- should `wait -n' with arguments look for terminated processes
in that table, the way `wait' without options does?

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait -n misses signaled subprocess

2024-01-29 Thread Robert Elz
Date:Mon, 29 Jan 2024 12:07:53 -0500
From:Chet Ramey 
Message-ID:  

  | What does `wait -n' without job arguments mean?

Find, or if there are none already, wait*(2) for, a process (job technically)
that has changed state (terminated in POSIX, and one day in the NetBSD
shell, that difference isn't relevant here) and return its status.
If there's already a terminated job (job which has changed status in bash)
then no wait type sys call gets performed (that already happened).

It also returns the status of that process, rather than simple "0" which
a bare "wait" does (and with the appropriate arg, tells you which process
it was).

  | OK. Since wait without options can already wait for the same pid multiple
  | times, the -n option has to bring some new functionality here.

Yes, without args, it waits until all listed arg processes (jobs) are
finished (or changed state) and returns the status of the last.   With -n
it waits for any one of them, just as the bash man page says it will.
The "any one" (vs "all") is the new functionality.

  | As long as it's still in the jobs list.

Yes, of course - the final para of my message covered that case.

  | OK. We can agree there shouldn't be any difference between `wait pid'
  | and `wait -n pid'.

Yes, but just because that's a degenerate case of the more general commands,
which happens in each case to devolve into the same thing.

And from a different message:

chet.ra...@case.edu said:
  | So should the shell require the user to periodically run `wait' in a non-
  | interactive shell without job control to clean dead jobs out of the jobs
  | list? I don't think so. 

I do.   wait or jobs ("jobs >/dev/null" is a nice simple clean up, without
the potential hang waiting for things to terminate that the wait utility
imposes).   A new option to wait(1) (either a simple one, perhaps -t, to
only wait for already terminated jobs, or a timeout, where 0 indicates never
to wait at all (ie: don't do the wait sys call) which would be a more
general, but more costly, mechanism).   But as long as it is just a matter
of cleaning up, and jobs works for that, I don't currently see the need.

Of course, you're also allowed to dump processes from the lists if there
get to be too many of them, but on modern systems, it really should be
possible to retain hundreds, if not thousands, without any real problem.

And of course, you're not required to retain status of any job if there's
no way that the script can request it - but determining that these days is
difficult.  It used to be easy in the Sys V/POSIX model where if $! wasn't
saved, then there was no way for the script to request the status, as it
couldn't (reasonably - parsing job trees from ps output doesn't count) find
out the pid to wait for (and simple "wait" never returns any status).

These days, with the jobs command available, a script could do
pids=$(jobs -l | code to parse the output and print the pids)
and determine what it can wait for that way (the code isn't difficult)
- and it can also wait on %1 %2 ... without having any idea what the pids
might be, so in practice adding the (non-trivial) code to monitor references
to $! isn't worth the bother (IMO).

It's also a bit unusual for non-interactive code to run lots of async jobs
without waiting for results - doing that is a sure way to run into the
"max user processes" limit, and have things start failing.   If there are
less than that, then having the shell retain the info until the script
terminates isn't really a very big cost, should the script not bother to
ever clean up.

  | I think it's whether or not `wait -n pid' behaves the same as `wait pid' and
  | looks in the list of saved exit statuses if the pid isn't found in a job in
  | the jobs list. 

We have it simpler than that, there's just one list, which serves both
purposes.  Makes things easier I believe (in all three of: shell code, shell
doc, and user understanding), even if it does consume a few more bytes for
a little longer than is really needed (jobs needs the command strings, so
they can be printed, wait doesn't, so retaining that is an extra cost ... not
one large enough for anyone to have ever noticed though).

kre




Re: wait -n misses signaled subprocess

2024-01-29 Thread Chet Ramey

On 1/29/24 7:54 AM, Andreas Schwab wrote:

On Jan 29 2024, Robert Elz wrote:


I always wondered why the option was 'n'


n = next?


Yes: the original implementation polled the non-terminated background jobs
and returned when one of them exited.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait -n misses signaled subprocess

2024-01-29 Thread Chet Ramey

On 1/29/24 12:33 PM, Chet Ramey wrote:


You should have. You told me about your implementation using `-n' in
10/2017, long before I implemented it (4/2020).


Sorry, this is my mistake. That was a different feature. Bash implemented
`wait -n' first.


For those wondering, the `different feature' was having `wait -n' pay
attention to its pid/job arguments.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait -n misses signaled subprocess

2024-01-29 Thread Chet Ramey

On 1/29/24 12:07 PM, Chet Ramey wrote:

On 1/29/24 7:12 AM, Robert Elz wrote:

 Date:    Sun, 28 Jan 2024 18:21:42 -0500
 From:    Chet Ramey 
 Message-ID:  <3347f790-529b-4bee-91fd-de39bed3f...@case.edu>

   | because `wait -n' doesn't look in the table
   | of saved statuses -- its job is to wait for `new' jobs to terminate, 
not

   | ones that have already been removed from the table.

That's very interesting, and most unexpected information.

I always wondered why the option was 'n' - I would have made it
be 'a' probably, as a shorthand for "any" - but then I decided
that perhaps 'n' was a better choice, as "a" could also be "all",
the option name would not be providing any real clue at all, so
I assumed you'd been ultra clever and used 'n' as the next char
in "any" and also as it can be read like the first part of "en" "ee"
(which you need to say out loud, or at least in your head, to get the
effect of).


You should have. You told me about your implementation using `-n' in
10/2017, long before I implemented it (4/2020).


Sorry, this is my mistake. That was a different feature. Bash implemented
`wait -n' first.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait -n misses signaled subprocess

2024-01-29 Thread Chet Ramey

On 1/29/24 7:12 AM, Robert Elz wrote:

 Date:Sun, 28 Jan 2024 18:21:42 -0500
 From:Chet Ramey 
 Message-ID:  <3347f790-529b-4bee-91fd-de39bed3f...@case.edu>

   | because `wait -n' doesn't look in the table
   | of saved statuses -- its job is to wait for `new' jobs to terminate, not
   | ones that have already been removed from the table.

That's very interesting, and most unexpected information.

I always wondered why the option was 'n' - I would have made it
be 'a' probably, as a shorthand for "any" - but then I decided
that perhaps 'n' was a better choice, as "a" could also be "all",
the option name would not be providing any real clue at all, so
I assumed you'd been ultra clever and used 'n' as the next char
in "any" and also as it can be read like the first part of "en" "ee"
(which you need to say out loud, or at least in your head, to get the
effect of).


You should have. You told me about your implementation using `-n' in
10/2017, long before I implemented it (4/2020).


It never even dawned on me that 'n' might mean "new", as in only
processes that hadn't terminated at the time the wait -n was done,
as that's essentially a recipe for script madness, race conditions
galore, as the one reported here.


What does `wait -n' without job arguments mean?


What wait(1) needed was an alternative to its normal "all" semantic,
just "wait" waits for every background job to terminate, what's needed
is a way to wait for any one of them (whether already terminated, but
not previously waited for or not).   That's what I always assumed
wait -n was doing, and how I implemented it in the NetBSD shell.


OK. Since wait without options can already wait for the same pid multiple
times, the -n option has to bring some new functionality here.




Similarly "wait pid1 pid2 pid3" waits for all 3 of those to
terminate, so "wait -n pid1 pid2 pid3" should wait for any one
of them - already terminated or not. 


As long as it's still in the jobs list.



 When there's just one pid
in the list, the -n option always seemed useless to me, there
ought be no difference between "wait pid" and "wait -n pid"
(as in wait for all of one, and wait for any of one, mean the
same thing, wait for that one), but obviously should still be
supported for consistency. 


OK. We can agree there shouldn't be any difference between `wait pid'
and `wait -n pid'.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait -n misses signaled subprocess

2024-01-29 Thread Chet Ramey

On 1/28/24 7:19 PM, Steven Pelley wrote:

Thank you Chet for your thorough reply.

You make a few comments about differences in output (stderr for not
finding a job, notifications for jobs terminating) and in all cases I
believe you are correct.  Let's assume job control is disabled.


OK, but remember:

"When job control isn't enabled (usually in a non-interactive shell), the
shell doesn't notify users about terminated background jobs, but it still
removes dead jobs from the jobs list before reading the next command. It
cleans the jobs table of notified jobs at other times, too, to move dead
jobs out of the jobs list and keep it a manageable size."

These exit statuses are still available to `wait pid' (but not `wait -n
pid') as POSIX specfies.





I expect the line ending (BUG) to indicate a return code of 143.


It might, if `wait -n' looked for already-notified jobs in the table of
saved exit statuses, but it doesn't. Should it, even if the user has
already been notified of the status of that job?


When job control is disabled I get this output for the same test (just
for consistent reference):


The results are consistent with what I described previously.



There's no user notification of the job terminating because job
control is disabled.  The "wait -n" returning 127 is the first
opportunity the shell might have to notify the user of the job. 


So should the shell require the user to periodically run `wait' in a non-
interactive shell without job control to clean dead jobs out of the jobs
list? I don't think so.


In
this context I think that "even if the user has already been notified
of the status of that job" doesn't apply -- the user hasn't been
notified of the job terminating. 


See above.


Even so, this behavior differs from a similar example but where the
first job ends successfully, or at least without being killed by a
signal.  It still terminates prior to calling "wait -n" (this is from
Jan 24 but I'll post again to keep everything in a linear thread).
echo "TEST: EXIT 0 PRIOR TO wait -n @${SECONDS}"
{ sleep 1; echo "child finishing @${SECONDS}"; exit 1; } &
pid=$!
echo "child proc $pid @${SECONDS}"
sleep 2
wait -n $pid
echo "wait -n $pid return code $? @${SECONDS}"

output (no job control):
TEST: EXIT 0 PRIOR TO wait -n @0
child proc 2779 @0
child finishing @1
wait -n 2779 return code 1 @2

It does look in the table of saved exit statuses, returning 1.


It doesn't. In this case, the code path it follows marks the job as dead
but doesn't mark it as notified (since it exited normally), so it's still
in the jobs list when `wait -n' is called, and available for returning.
That's probably a bug there.



I think the sticking point is the notion of the user being notified of
the status of a job. 


I think it's whether or not `wait -n pid' behaves the same as `wait pid'
and looks in the list of saved exit statuses if the pid isn't found in a
job in the jobs list.

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait -n misses signaled subprocess

2024-01-29 Thread Greg Wooledge
On Mon, Jan 29, 2024 at 08:52:37PM +0700, Robert Elz wrote:
> Date:Mon, 29 Jan 2024 13:54:10 +0100
> From:Andreas Schwab 
> Message-ID:  
> 
>   | n = next?

This was my assumption as well.

> That would be a reasonable interpretation, I guess, but
> unfortunately not one which helps the current question,
> as it doesn't answer "next what?"

For the record, with bash 5.2:


unicorn:~$ cat foo
#!/bin/bash

sleep 1 &
sleep 37 &
sleep 2
time wait -n
unicorn:~$ ./foo
real 0.001  user 0.000  sys 0.001
unicorn:~$ ps
PID TTY  TIME CMD
   1152 pts/300:00:00 bash
 542197 pts/300:00:00 sleep
 542200 pts/300:00:00 ps
unicorn:~$ ps -fp 542197
UID  PIDPPID  C STIME TTY  TIME CMD
greg  542197   1  0 08:59 pts/300:00:00 sleep 37


wait -n *does* appear to acknowledge the already-terminated child process,
despite a second child process still being active.



Re: wait -n misses signaled subprocess

2024-01-29 Thread Robert Elz
Date:Mon, 29 Jan 2024 13:54:10 +0100
From:Andreas Schwab 
Message-ID:  

  | n = next?

That would be a reasonable interpretation, I guess, but
unfortunately not one which helps the current question,
as it doesn't answer "next what?"   It could be "the next
of these processes which terminates" (like the "new"
interpretation) or "the next of these processes that has
a status available" (like the "any" interpretation).

While I'm here, I will also mention that the bash man page
section for wait(1) does say "any" in one place, and equivalent
(but better) wording in another ("a single job"), but never
mentions "new" anywhere.

Further in both the -n and no -n cases, the wait utility is
stated to "wait for" (whatever is appropriate for the args given)
hence the operation should be assumed to be the same in both
cases, either an actual pause is required in both (until some
appropriate process changes status) or is not required in either
(if such a process has already terminated and is waiting for
shell level reaping).

Note that processes that have already been reported (via wait,
or jobs, or the prompt level jobs lookalike) have already been
reported, so if any of that had happened wait isn't expected to
be able to fetch status from them again.

kre



Re: wait -n misses signaled subprocess

2024-01-29 Thread Andreas Schwab
On Jan 29 2024, Robert Elz wrote:

> I always wondered why the option was 'n'

n = next?

-- 
Andreas Schwab, SUSE Labs, sch...@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."



Re: wait -n misses signaled subprocess

2024-01-29 Thread Robert Elz
Date:Sun, 28 Jan 2024 18:21:42 -0500
From:Chet Ramey 
Message-ID:  <3347f790-529b-4bee-91fd-de39bed3f...@case.edu>

  | because `wait -n' doesn't look in the table
  | of saved statuses -- its job is to wait for `new' jobs to terminate, not
  | ones that have already been removed from the table.

That's very interesting, and most unexpected information.

I always wondered why the option was 'n' - I would have made it
be 'a' probably, as a shorthand for "any" - but then I decided
that perhaps 'n' was a better choice, as "a" could also be "all",
the option name would not be providing any real clue at all, so
I assumed you'd been ultra clever and used 'n' as the next char
in "any" and also as it can be read like the first part of "en" "ee"
(which you need to say out loud, or at least in your head, to get the
effect of).

It never even dawned on me that 'n' might mean "new", as in only
processes that hadn't terminated at the time the wait -n was done,
as that's essentially a recipe for script madness, race conditions
galore, as the one reported here.

What wait(1) needed was an alternative to its normal "all" semantic,
just "wait" waits for every background job to terminate, what's needed
is a way to wait for any one of them (whether already terminated, but
not previously waited for or not).   That's what I always assumed
wait -n was doing, and how I implemented it in the NetBSD shell.

Similarly "wait pid1 pid2 pid3" waits for all 3 of those to
terminate, so "wait -n pid1 pid2 pid3" should wait for any one
of them - already terminated or not.   When there's just one pid
in the list, the -n option always seemed useless to me, there
ought be no difference between "wait pid" and "wait -n pid"
(as in wait for all of one, and wait for any of one, mean the
same thing, wait for that one), but obviously should still be
supported for consistency.   To think that it might be interpreted
as "wait for a new process "pid" to terminate, ignoring the one that
just finished a few milliseconds ago" is simply astounding, completely
unbelievable.

And from what I have seen of the other comments, several from
long term & dedicated bash users, it is just as astounding to
them as well.   Please treat this as a bug, and fix it.  Quickly.

kre



Re: wait -n misses signaled subprocess

2024-01-28 Thread Oğuz
On Monday, January 29, 2024, Greg Wooledge  wrote:
>
> Anyway... a script writer who has a basic familiarity with wait(2) and
> who reads about "wait -n" will probably assume that wait -n will return
> immediately if a child process has already terminated and hasn't been
> "pseudo-reaped" by a previous "wait" command yet.  If three children
> have terminated, then the next three "wait -n" commands should return
> immediately, and the fourth should block (assuming a fourth child exists).
>

This is the case with me. There is no point in having `wait -n' if it can't
distinguish a single job terminating from multiple jobs terminating between
subsequent calls.


-- 
Oğuz


Re: wait -n misses signaled subprocess

2024-01-28 Thread Greg Wooledge
On Sun, Jan 28, 2024 at 10:26:27PM -0500, Dale R. Worley wrote:
> The man page doesn't make clear that if you don't specify "-n" and do
> supply ids and one of them has already terminated, you'll get its status
> (from the terminated table); the wording suggests that "wait" will
> always *wait for* a termination.

This might be a result of C programmers who already know the semantics
of wait(2) writing documentation which assumes the reader *also* knows
these semantics.

wait(2) and its brethren return immediately if the process in question
has already terminated.  It's how you reap the zombie and free up the
process table slot, while also retrieving its exit status.  If it's
not already dead, then wait(2) blocks until death occurs.

The shell's "wait" command is meant to mimic this behavior, at its core.
There are some differences, however -- notably, the shell aggressively
reaps zombies and stores their exit statuses in memory, revealing them
to you in the event that you call "wait".  Normally this change is
invisible, but if you were *counting* on the zombie to be there, holding
on to that PID, preventing it from being reused until you could observe
the death and react to it, then you're screwed.  Don't use the shell for
this.

Anyway... a script writer who has a basic familiarity with wait(2) and
who reads about "wait -n" will probably assume that wait -n will return
immediately if a child process has already terminated and hasn't been
"pseudo-reaped" by a previous "wait" command yet.  If three children
have terminated, then the next three "wait -n" commands should return
immediately, and the fourth should block (assuming a fourth child exists).



Re: wait -n misses signaled subprocess

2024-01-28 Thread Dale R. Worley
Chet Ramey  writes:
>> echo "wait -n $pid return code $? @${SECONDS} (BUG)"
>
> The job isn't in the jobs table because you've already been notified about
> it and it's not `new', you get the unknown job error status.

The man page gives a lot of details and I'm trying to digest them into a
structure.

It looks like the underlying meaning of "-n" is to only pay attention to
*new* job completions, and anything "in the past" (already notified and
moved to the table of terminated background jobs) is ignored.

The underlying meaning of providing one or more ids is that "wait" is to
only be concerned with those jobs.

The man page doesn't make clear that if you don't specify "-n" and do
supply ids and one of them has already terminated, you'll get its status
(from the terminated table); the wording suggests that "wait" will
always *wait for* a termination.

There's also an interaction in that "wait" will only look at the
terminated table if "-n" is not specified *and* ids are specified.

Am I understanding this correctly?

Dale



Re: wait -n misses signaled subprocess

2024-01-28 Thread Steven Pelley
Thank you Chet for your thorough reply.

You make a few comments about differences in output (stderr for not
finding a job, notifications for jobs terminating) and in all cases I
believe you are correct.  Let's assume job control is disabled.

> >
> > I expect the line ending (BUG) to indicate a return code of 143.
>
> It might, if `wait -n' looked for already-notified jobs in the table of
> saved exit statuses, but it doesn't. Should it, even if the user has
> already been notified of the status of that job?

When job control is disabled I get this output for the same test (just
for consistent reference):
TEST: KILL PRIOR TO wait -n @0
kill -TERM 526 @0
./test.sh: line 13: wait: 526: no such job
wait -n 526 return code 127 @2 (BUG)
wait 526 return code 143 @2
TEST: KILL DURING wait -n @2
kill -TERM 544 @3
wait -n 544 return code 143 @3
wait 544 return code 143 @3

There's no user notification of the job terminating because job
control is disabled.  The "wait -n" returning 127 is the first
opportunity the shell might have to notify the user of the job.  In
this context I think that "even if the user has already been notified
of the status of that job" doesn't apply -- the user hasn't been
notified of the job terminating.  It's possible you are saying that
the user was notified of the job's termination in some other way that
I missed, so please tell me if I'm misunderstanding this part.

Even so, this behavior differs from a similar example but where the
first job ends successfully, or at least without being killed by a
signal.  It still terminates prior to calling "wait -n" (this is from
Jan 24 but I'll post again to keep everything in a linear thread).
echo "TEST: EXIT 0 PRIOR TO wait -n @${SECONDS}"
{ sleep 1; echo "child finishing @${SECONDS}"; exit 1; } &
pid=$!
echo "child proc $pid @${SECONDS}"
sleep 2
wait -n $pid
echo "wait -n $pid return code $? @${SECONDS}"

output (no job control):
TEST: EXIT 0 PRIOR TO wait -n @0
child proc 2779 @0
child finishing @1
wait -n 2779 return code 1 @2

It does look in the table of saved exit statuses, returning 1.

I think the sticking point is the notion of the user being notified of
the status of a job.  In these examples I don't see that the user is
notified prior to the first call to "wait -n," and so I think that
this call should notify the user.  This first call to "wait -n" _does_
notify the user in the case that the job terminated by exiting (not
signalled), but _does not_ notify the user in the case that the job
was killed.

Steve



Re: wait -n misses signaled subprocess

2024-01-28 Thread Chet Ramey

On 1/22/24 11:30 AM, Steven Pelley wrote:


I've tried:
killing with SIGTERM and SIGALRM
killing from the test script, a subshell, and another terminal.  I
don't believe this is related to kill being a builtin.
enabling job control (set -m)
bash versions 4.4.12, 5.2.15, 5.2.21.  All linux arm64


You must have left `set -m' enabled in the version whose results you
posted, since you don't get non-interactive status notifications unless
you do.

Let's see if we can go through what happens. Part of it has to do with
notifications and when the shell removes jobs from the jobs table.

When the shell is interactive, and job control is enabled, it checks for
terminated background jobs, notifies the user about their status if
appropriate, and removes them from the jobs list -- bash removes a job
from the list when it's notified the user of its status -- when it goes
to read a new command, before printing the prompt. In a non-interactive
shell, it obviously doesn't print a prompt, but it does the same thing,
even the notification, before reading the next command.

When job control isn't enabled (usually in a non-interactive shell), the
shell doesn't notify users about terminated background jobs, but it still
removes dead jobs from the jobs list before reading the next command. It
cleans the jobs table of notified jobs at other times, too, to move dead
jobs out of the jobs list and keep it a manageable size.

The shell does keep a table of terminated background jobs that have been
removed from the jobs list, because POSIX says you have to keep track of
the last CHILD_MAX pids and make their exit statuses available to `wait'
(but see below).


Test script:
# change to test other signals
sig=TERM

echo "TEST: KILL PRIOR TO wait -n @${SECONDS}"
{ sleep 1; exit 1; } & > pid=$!


This ends up adding this to the jobs table as job 1. $pid is the pgrp
leader.


echo "kill -$sig $pid @${SECONDS}"
kill -$sig $pid


You kill that job, it terminates, the shell gets the SIGCHLD and waits
for it, marks it as dead in the jobs table, and goes to read the next
command. It doesn't matter whether this happens before the sleep or the
wait; the job gets removed as soon as the user is notified and moved to
the table of saved statuses. (If the shell isn't doing notifications,
the job just gets moved.)



sleep 2
wait -n $pid


When I run this, whether job control is enabled or not, I get an error
message about an unknown job, because `wait -n' doesn't look in the table
of saved statuses -- its job is to wait for `new' jobs to terminate, not
ones that have already been removed from the table. Maybe you're
redirecting stderr.


echo "wait -n $pid return code $? @${SECONDS} (BUG)"


The job isn't in the jobs table because you've already been notified about
it and it's not `new', you get the unknown job error status.


wait $pid > echo "wait $pid return code $? @${SECONDS}"


This works, because wait without -n looks in the table of saved statuses.



echo "TEST: KILL DURING wait -n @${SECONDS}"
{ sleep 2; exit 1; } &
pid=$!
{ sleep 1; echo "kill -$sig $pid @${SECONDS}"; kill -$sig $pid; } &

wait -n $pid


The shell doesn't get the SIGCHLD before running wait, so the job is still
in the jobs list.


echo "wait -n $pid return code $? @${SECONDS}"
wait $pid
echo "wait $pid return code $? @${SECONDS}"


And you get the same status here. Even though the `wait -n' removes the
job from the jobs list, the subsequent `wait' can still find it in the
table of saved exit statuses.




For which I get the following example output:
TEST: KILL PRIOR TO wait -n @0
kill -TERM 1384 @0
./test.sh: line 14:  1384 Terminated  { sleep 1; exit 1; }
wait -n 1384 return code 127 @2 (BUG)
wait 1384 return code 143 @2
TEST: KILL DURING wait -n @2
kill -TERM 1402 @3
./test.sh: line 25:  1402 Terminated  { sleep 2; exit 1; }
wait -n 1402 return code 143 @3
wait 1402 return code 143 @3

I expect the line ending (BUG) to indicate a return code of 143.


It might, if `wait -n' looked for already-notified jobs in the table of
saved exit statuses, but it doesn't. Should it, even if the user has
already been notified of the status of that job?

Chet
--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait -n misses signaled subprocess

2024-01-24 Thread Oğuz
On Mon, Jan 22, 2024 at 8:13 PM Steven Pelley  wrote:
>
> Hello,
> I've encountered what I believe is a bug in bash's "wait -n".  wait -n
> fails to return for processes that terminate due to a signal prior to
> calling wait -n.  Instead, it returns 127 with an error that the
> process id cannot be found.  Calling wait  (without -n) then
> returns its exit code (e.g., 143).  I expect wait -n to return each
> process through successive calls to wait -n, which is the case for
> processes that terminate in other manners even prior to calling wait
> -n.

I agree that this is a bug in bash.
jobs.c/wait_for_any_jobs() marks all dead jobs as notified after
reporting the status of the first one and misses the rest. With the
following change (not a real fix, just for demonstration), devel
branch behaves as expected:

diff --git a/jobs.c b/jobs.c
index 3e68bf24..d7c8d11b 100644
--- a/jobs.c
+++ b/jobs.c
@@ -3257,7 +3257,7 @@ wait_for_any_job (int flags, struct procstat *ps)
 {
   if ((flags & JWAIT_WAITING) && jobs[i] && IS_WAITING (i) == 0)
continue;   /* if we don't want it, skip it */
-  if (jobs[i] && DEADJOB (i) && IS_NOTIFIED (i) == 0 &&
IS_FOREGROUND (i) == 0)
+  if (jobs[i] && DEADJOB (i) && IS_FOREGROUND (i) == 0)
{
 return_job:
  r = job_exit_status (i);



Re: wait -n misses signaled subprocess

2024-01-24 Thread Steven Pelley
Apologies for a quick double post, strace is fairly straightforward
and confirms that bash is properly reaping the killed processes.  This
isn't a matter of the wait syscall failing to return the signaled
child process.

Running the test from my original post and producing:
TEST: KILL PRIOR TO wait -n @0
kill -TERM 6941 @0
./test.sh: line 13: wait: 6941: no such job
wait -n 6941 return code 127 @2 (BUG)
wait 6941 return code 143 @2
TEST: KILL DURING wait -n @2
kill -TERM 6970 @3
wait -n 6970 return code 143 @3
wait 6970 return code 143 @3

shows:
kill(6941, SIGTERM) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=6941,
si_uid=1000, si_status=SIGTERM, si_utime=0, si_stime=0} ---
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], WNOHANG, NULL) = 6941
wait4(-1, 0xc62b6d50, WNOHANG, NULL) = -1 ECHILD (No child processes)
rt_sigreturn({mask=[]})

and

wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], 0, NULL) = 6970
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0},
{sa_handler=0xd98a21a4, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=6970,
si_uid=1000, si_status=SIGTERM, si_utime=0, si_stime=0} ---
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 6972
wait4(-1, 0xc62b6860, WNOHANG, NULL) = -1 ECHILD (No child processes)
rt_sigreturn({mask=[]})

Signaling prior to wait -n (pid 6941) is awaited (wait4) in the
SIGCHLD signal handler and determines that it was signaled and
terminated due to SIGTERM.
Signaling during wait -n (pid 6970) is awaited prior to the SIGCHLD
signal indicating it was killed by a blocking call to wait4, also
returning that it was signaled and terminated due to SIGTERM.
The only difference I see here is whether the subprocess is awaited by
the blocking call rather than the nonblocking call inside the SIGCHLD
handler.  For what it's worth I see subprocesses that terminate
without signal also showing up in wait4 calls outside the SIGCHLD
handler but this could easily be a matter of chance timing and a red
herring.

Steve

On Wed, Jan 24, 2024 at 12:40 PM Steven Pelley  wrote:
>
> > In the first case, if the subprocess N has terminated, its report is
> > still queued and "wait" retrieves it.  In the second case, if the
> > subprocess N has terminated, it doesn't exist and as the manual page
> > says "If id specifies a non-existent process or job, the return status
> > is 127."
> >
> > What you're pointing out is that that creates a race condition when the
> > subprocess ends before the "wait".  And it seems that the kernel has
> > enough information to tell "wait -n N", "process N doesn't exist, but
> > you do have a queued termination report for it".  But it's not clear
> > that there's a way to ask the kernel for that information without
> > reading all the queued termination reports (and losing the ability to
> > return them for other "wait" calls).
>
> Thanks for the response, but I don't believe this is correct.
>
> Your understanding of the wait syscall is correct except that the exit
> code and process information always remains available until the
> process is awaited by its parent -- it is the wait syscall that itself
> reaps the process and makes it unavailable to later searches by pid.
> There is a possibility that the parent (bash in this case) might reap
> the process in multiple ways (i.e., from different threads, setting
> the SIGCHLD disposition to SIG_IGN, setting flat SA_NOCLDWAIT for the
> SIGCHLD handler -- the last 2 from NOTES of man waitpid on linux) that
> race with each other, but the parent is always given an opportunity to
> read the exit code and reap the process if not disabled with SIGCHLD
> handler configuration.
>
> My understanding of bash is that it internally maintains a queue/list
> of finished child jobs to return such that wait -n mimics aspects of
> the wait syscall.  The discussion at
> https://lists.gnu.org/archive/html/bug-bash/2023-05/msg00063.html
> supports that bash "silently" reaps child processes and decouples the
> wait syscall from the wait command.
>
> I assume it's possible to confirm that bash is awaiting the process
> and retrieving the exit code via ptrace/strace but I'm unfamiliar with
> these tools or bash logs.
>
> The test below allows the subprocess to complete normally, without
> being signaled, and then successfully retrieves its exit code via wait
> -n.  This subprocess terminates before the call to wait -n.  I see no
> documented reason that a process terminating without signal prior to
> wait -n should be returned while a process terminating with signal
> prior to wait -n should not.
>
> echo "TEST: EXIT 0 PRIOR TO wait -n @${SECONDS}"
> { sleep 1; echo "child finishing @${SECONDS}"; exit 1; } &
> pid=$!
> echo "child proc $pid @${SECONDS}"
>
> sleep 2
> wait -n $pid
> echo "wait -n $pid return code $? @${SECONDS}"
>
>
> For which I get output:
> TEST: 

Re: wait -n misses signaled subprocess

2024-01-24 Thread Steven Pelley
> In the first case, if the subprocess N has terminated, its report is
> still queued and "wait" retrieves it.  In the second case, if the
> subprocess N has terminated, it doesn't exist and as the manual page
> says "If id specifies a non-existent process or job, the return status
> is 127."
>
> What you're pointing out is that that creates a race condition when the
> subprocess ends before the "wait".  And it seems that the kernel has
> enough information to tell "wait -n N", "process N doesn't exist, but
> you do have a queued termination report for it".  But it's not clear
> that there's a way to ask the kernel for that information without
> reading all the queued termination reports (and losing the ability to
> return them for other "wait" calls).

Thanks for the response, but I don't believe this is correct.

Your understanding of the wait syscall is correct except that the exit
code and process information always remains available until the
process is awaited by its parent -- it is the wait syscall that itself
reaps the process and makes it unavailable to later searches by pid.
There is a possibility that the parent (bash in this case) might reap
the process in multiple ways (i.e., from different threads, setting
the SIGCHLD disposition to SIG_IGN, setting flat SA_NOCLDWAIT for the
SIGCHLD handler -- the last 2 from NOTES of man waitpid on linux) that
race with each other, but the parent is always given an opportunity to
read the exit code and reap the process if not disabled with SIGCHLD
handler configuration.

My understanding of bash is that it internally maintains a queue/list
of finished child jobs to return such that wait -n mimics aspects of
the wait syscall.  The discussion at
https://lists.gnu.org/archive/html/bug-bash/2023-05/msg00063.html
supports that bash "silently" reaps child processes and decouples the
wait syscall from the wait command.

I assume it's possible to confirm that bash is awaiting the process
and retrieving the exit code via ptrace/strace but I'm unfamiliar with
these tools or bash logs.

The test below allows the subprocess to complete normally, without
being signaled, and then successfully retrieves its exit code via wait
-n.  This subprocess terminates before the call to wait -n.  I see no
documented reason that a process terminating without signal prior to
wait -n should be returned while a process terminating with signal
prior to wait -n should not.

echo "TEST: EXIT 0 PRIOR TO wait -n @${SECONDS}"
{ sleep 1; echo "child finishing @${SECONDS}"; exit 1; } &
pid=$!
echo "child proc $pid @${SECONDS}"

sleep 2
wait -n $pid
echo "wait -n $pid return code $? @${SECONDS}"


For which I get output:
TEST: EXIT 0 PRIOR TO wait -n @0
child proc 2270 @0
child finishing @1
wait -n 2270 return code 1 @2


Steve



Re: wait -n misses signaled subprocess

2024-01-24 Thread Dale R. Worley
Steven Pelley  writes:
> wait -n
> fails to return for processes that terminate due to a signal prior to
> calling wait -n.  Instead, it returns 127 with an error that the
> process id cannot be found.  Calling wait  (without -n) then
> returns its exit code (e.g., 143).

My understanding is that this is how "wait" is expected to work, or at
least known to work, but mostly because that's how the *kernel* works.

"wait" without -n makes a system call which means "give me information
about a terminated subprocess".  The termination (or perhaps
change-of-state) reports from subprocesses are queued up in the kernel
until the process retrieves them through "wait" system calls.

OTOH, "wait" with -n makes a system call which means "give me
information about my subprocess N".

In the first case, if the subprocess N has terminated, its report is
still queued and "wait" retrieves it.  In the second case, if the
subprocess N has terminated, it doesn't exist and as the manual page
says "If id specifies a non-existent process or job, the return status
is 127."

What you're pointing out is that that creates a race condition when the
subprocess ends before the "wait".  And it seems that the kernel has
enough information to tell "wait -n N", "process N doesn't exist, but
you do have a queued termination report for it".  But it's not clear
that there's a way to ask the kernel for that information without
reading all the queued termination reports (and losing the ability to
return them for other "wait" calls).

Then again, I might be wrong.

Dale



wait -n misses signaled subprocess

2024-01-22 Thread Steven Pelley
Hello,
I've encountered what I believe is a bug in bash's "wait -n".  wait -n
fails to return for processes that terminate due to a signal prior to
calling wait -n.  Instead, it returns 127 with an error that the
process id cannot be found.  Calling wait  (without -n) then
returns its exit code (e.g., 143).  I expect wait -n to return each
process through successive calls to wait -n, which is the case for
processes that terminate in other manners even prior to calling wait
-n.  Killing a process while the wait -n is actively blocking works
correctly.  Test script at bottom.

The specific situation I encountered this is when trying to coordinate
my own cooperative exit and handling/propagating SIGTERM.  If I
propagate this SIGTERM by killing multiple processes at once (kill
pid1 pid2 pid3 ...) the next call to wait -n will return 143 and
indicate a pid (via -p) but the next call to wait -n returns 127 as
all processes previously terminated.  If any of the awaited processes
haven't yet terminated then you only discover the previously-killed
process whenever the next terminates.  I have workarounds/I'm not
blocked but this seems a reasonable use case and worth sharing.

I've tried:
killing with SIGTERM and SIGALRM
killing from the test script, a subshell, and another terminal.  I
don't believe this is related to kill being a builtin.
enabling job control (set -m)
bash versions 4.4.12, 5.2.15, 5.2.21.  All linux arm64

Test script:
# change to test other signals
sig=TERM

echo "TEST: KILL PRIOR TO wait -n @${SECONDS}"
{ sleep 1; exit 1; } &
pid=$!
echo "kill -$sig $pid @${SECONDS}"
kill -$sig $pid

sleep 2
wait -n $pid
echo "wait -n $pid return code $? @${SECONDS} (BUG)"
wait $pid
echo "wait $pid return code $? @${SECONDS}"

echo "TEST: KILL DURING wait -n @${SECONDS}"
{ sleep 2; exit 1; } &
pid=$!
{ sleep 1; echo "kill -$sig $pid @${SECONDS}"; kill -$sig $pid; } &

wait -n $pid
echo "wait -n $pid return code $? @${SECONDS}"
wait $pid
echo "wait $pid return code $? @${SECONDS}"


For which I get the following example output:
TEST: KILL PRIOR TO wait -n @0
kill -TERM 1384 @0
./test.sh: line 14:  1384 Terminated  { sleep 1; exit 1; }
wait -n 1384 return code 127 @2 (BUG)
wait 1384 return code 143 @2
TEST: KILL DURING wait -n @2
kill -TERM 1402 @3
./test.sh: line 25:  1402 Terminated  { sleep 2; exit 1; }
wait -n 1402 return code 143 @3
wait 1402 return code 143 @3

I expect the line ending (BUG) to indicate a return code of 143.

Thanks,
Steve Pelley