Re: wait -n misses signaled subprocess
On 1/31/24 2:35 PM, Robert Elz wrote: | Not quite. `new' in this sense is the opposite of `anything in the past' | as Dale described it -- already notified and removed from the jobs list. I guess the part about bash that I am not understanding here is how the "already notified" works. To me there are just two ways for that, either the user has done a "wait" which has collected that pid already (either without -n, and no pid args, or with pid args and one of those is the pid in question) or with -n and the pid in question was the one whose status was returned, or the user/script did the jobs command (or jobs -l) and the job in question was shown as completed. Is there some other way? Notification after a job terminates due to a signal in a non-interactive shell -- that runs the equivalent of `jobs'. As it turns out, this was the problem with Steven Pelley's original report. I fixed one issue, but that kind of notification will leave jobs marked as notified and eligible to be removed from the jobs list. | Half the problem here is that bash aggressively marks dead jobs as being | notified in non-interactive shells without job control enabled, and moves | them out of the jobs table. That might be more than half the problem, it might be the entire problem. It seems to be in this case. It's a good thing it's limited to processes that terminate due to signals; a bad thing that processes terminating due to signals was the entire subject of the original report. | but if you | do, or if you use wait -n with pid/job arguments (which you've presumably | saved yourself) you're going to need slightly different semantics than we | have now to answer that reliably. And that will probably need a new option. That's a pity, particularly since the current semantics don't seem to be useful in general. Shoehorning pid/job arguments into the previous behavior, which only dealt with running jobs, resulted in the current semantics. I should probably have made `wait -n' with pid arguments look at terminated and notified processes, but I didn't change the `running job' semantics. Hindsight. Since the sole issue provoking that seems to be the wait over and over policy, It's not a policy, per se, it's behavior that has historically worked that way. rather than "wait once, and remove completely" POSIX semantics. perhaps rather than a new, but different, -n like option, a better idea would be a "only once" option (ie: if the option (-r (remove) or -c (cleanup) or -o (once only)) is set, then when the wait with that option returns status or, or waits until termination without returning status (in the not -n case, with no pid args, or many pid args) then the processes are completely deleted from everywhere in the shell. Or you could use posix mode with the recent change, already in devel, since POSIX requires this behavior (but see below). Using that option would make a changed -n safe to use in loops. If you do that, also add an option (maybe the upper case version of whatever is selected for that one, or just some other letter) to mean "don't wait" (kind of like wait(2) WNOWAIT) - which in default bash would just be a no-op (except in posix mode, apparently - whereas the -[cor] option would be a no-op in posix mode). You're not the only one to suggest some new option(s). Only one really matters for this discussion. If you were to do that, other shells could add the same (except in probably all of them, -[cor] would always be the default, and the other one would be the one which changes behaviour). That's always hit or miss. | > The one change that should be made is | > to allow wait -n to collect processes/jobs that have already terminated. | | Yes, that's one of the things we're talking about. I don't have any problem | with it, but should it take a new option to change those semantics? Good, though I think some more thought should go into that. In another thread you said (paraphrasing) correctly, that scripts should not be relying upon bugs, and the current wait -n behaviour is a bug - that it might have been intentionally coded that way doesn't make it any less so. Trust me, there are people on the other side of that question. It isn't as if it was ever documented to work the way it does, or everyone would have known about it already. You mean the behavior of `wait -n' with pid arguments, I presume. The problem with your statement is that people do know about it. The question, as above, is whether or not to avoid changing the behavior because they do. There are two things that we could change: 1. wait -n needs to get access to the list of terminated pids (the ones that satisfy POSIX's "CHILD_MAX processes known in the current shell environment"), like wait without -n does. This can happen via a wait option, a shell option, or a change in behavior controlled by the compatibility level. 2. Some option to implement the
Re: wait -n misses signaled subprocess
On Thu, Feb 1, 2024, 09:09 alex xmb sw ratchev wrote: > > > On Wed, Jan 31, 2024, 20:36 Robert Elz wrote: > >> Date:Wed, 31 Jan 2024 11:35:57 -0500 >> From:Chet Ramey >> Message-ID: <1e50aa99-8d53-4cdf-ba5e-6aaf3ccc6...@case.edu> >> >> | Not quite. `new' in this sense is the opposite of `anything in the >> past' >> | as Dale described it -- already notified and removed from the jobs >> list. >> >> I guess the part about bash that I am not understanding here is how the >> "already notified" works. To me there are just two ways for that, either >> the user has done a "wait" which has collected that pid already (either >> without -n, and no pid args, or with pid args and one of those is the pid >> in question) or with -n and the pid in question was the one whose status >> was returned, or the user/script did the jobs command (or jobs -l) and the >> job in question was shown as completed. >> > > i say additional datastructure for the saving purpose .. > it d need new uid , real-unique-id , or some special hash of the jobs/pids/cmdlines Is there some other way? >> >> | Half the problem here is that bash aggressively marks dead jobs as >> being >> | notified in non-interactive shells without job control enabled, and >> moves >> | them out of the jobs table. >> >> That might be more than half the problem, it might be the entire problem. >> >> | If you use wait -n without arguments, you probably don't care, >> >> No you do, that just means any of the children ... the script could make >> a list of all of them and supply that list, but if the list is just going >> to contain all the existing children, why bother?(With -n - and not >> exactly one pid arg, -p is generally going to be required, but that option >> has no bearing on which process is selected, or might be, which is the >> issue here). >> >> | but if you >> | do, or if you use wait -n with pid/job arguments (which you've >> presumably >> | saved yourself) you're going to need slightly different semantics >> than we >> | have now to answer that reliably. And that will probably need a new >> option. >> >> That's a pity, particularly since the current semantics don't seem to >> be useful in general. Since the sole issue provoking that seems to be >> the wait over and over policy, rather than "wait once, and remove >> completely" >> perhaps rather than a new, but different, -n like option, a better idea >> would >> be a "only once" option (ie: if the option (-r (remove) or -c (cleanup) >> or -o >> (once only)) is set, then when the wait with that option returns status >> or, >> or waits until termination without returning status (in the not -n case, >> with >> no pid args, or many pid args) then the processes are completely deleted >> from >> everywhere in the shell. Using that option would make a changed -n safe >> to use in loops. If you do that, also add an option (maybe the upper >> case >> version of whatever is selected for that one, or just some other letter) >> to >> mean "don't wait" (kind of like wait(2) WNOWAIT) - which in default bash >> would >> just be a no-op (except in posix mode, apparently - whereas the -[cor] >> option >> would be a no-op in posix mode). >> >> If you were to do that, other shells could add the same (except in >> probably >> all of them, -[cor] would always be the default, and the other one would >> be >> the one which changes behaviour). >> >> | And that's why I used `more': there are several differences, so which >> | of those differences should we attempt to change? >> >> Just the one. >> >> | > The one change that should be made is >> | > to allow wait -n to collect processes/jobs that have already >> terminated. >> | >> | Yes, that's one of the things we're talking about. I don't have any >> problem >> | with it, but should it take a new option to change those semantics? >> >> Good, though I think some more thought should go into that. In another >> thread you said (paraphrasing) correctly, that scripts should not be >> relying upon bugs, and the current wait -n behaviour is a bug - that it >> might have been intentionally coded that way doesn't make it any less so. >> It isn't as if it was ever documented to work the way it does, or everyone >> would have known about it already. >> >> | > Changing it to wait for all the listed pids >> | It's never done that. >> | We're not going to change the return value from wait. >> >> Good, I only mentioned those possibilities because your earlier >> message was unclear about what "more like wait without -n" meant. >> >> | Yeah, but we're talking about bash here. It doesn't really matter what >> | the Bourne shell did; there are likely plenty of scripts that assume >> | the historical bash behavior. >> >> Really? Why? What's the point of collecting the status twice? >> It can't change in the meantime can it, once a process has done exit(N) >> its exit status should always be N,
Re: wait -n misses signaled subprocess
On Wed, Jan 31, 2024, 20:36 Robert Elz wrote: > Date:Wed, 31 Jan 2024 11:35:57 -0500 > From:Chet Ramey > Message-ID: <1e50aa99-8d53-4cdf-ba5e-6aaf3ccc6...@case.edu> > > | Not quite. `new' in this sense is the opposite of `anything in the > past' > | as Dale described it -- already notified and removed from the jobs > list. > > I guess the part about bash that I am not understanding here is how the > "already notified" works. To me there are just two ways for that, either > the user has done a "wait" which has collected that pid already (either > without -n, and no pid args, or with pid args and one of those is the pid > in question) or with -n and the pid in question was the one whose status > was returned, or the user/script did the jobs command (or jobs -l) and the > job in question was shown as completed. > i say additional datastructure for the saving purpose .. Is there some other way? > > | Half the problem here is that bash aggressively marks dead jobs as > being > | notified in non-interactive shells without job control enabled, and > moves > | them out of the jobs table. > > That might be more than half the problem, it might be the entire problem. > > | If you use wait -n without arguments, you probably don't care, > > No you do, that just means any of the children ... the script could make > a list of all of them and supply that list, but if the list is just going > to contain all the existing children, why bother?(With -n - and not > exactly one pid arg, -p is generally going to be required, but that option > has no bearing on which process is selected, or might be, which is the > issue here). > > | but if you > | do, or if you use wait -n with pid/job arguments (which you've > presumably > | saved yourself) you're going to need slightly different semantics than > we > | have now to answer that reliably. And that will probably need a new > option. > > That's a pity, particularly since the current semantics don't seem to > be useful in general. Since the sole issue provoking that seems to be > the wait over and over policy, rather than "wait once, and remove > completely" > perhaps rather than a new, but different, -n like option, a better idea > would > be a "only once" option (ie: if the option (-r (remove) or -c (cleanup) or > -o > (once only)) is set, then when the wait with that option returns status or, > or waits until termination without returning status (in the not -n case, > with > no pid args, or many pid args) then the processes are completely deleted > from > everywhere in the shell. Using that option would make a changed -n safe > to use in loops. If you do that, also add an option (maybe the upper case > version of whatever is selected for that one, or just some other letter) to > mean "don't wait" (kind of like wait(2) WNOWAIT) - which in default bash > would > just be a no-op (except in posix mode, apparently - whereas the -[cor] > option > would be a no-op in posix mode). > > If you were to do that, other shells could add the same (except in probably > all of them, -[cor] would always be the default, and the other one would be > the one which changes behaviour). > > | And that's why I used `more': there are several differences, so which > | of those differences should we attempt to change? > > Just the one. > > | > The one change that should be made is > | > to allow wait -n to collect processes/jobs that have already > terminated. > | > | Yes, that's one of the things we're talking about. I don't have any > problem > | with it, but should it take a new option to change those semantics? > > Good, though I think some more thought should go into that. In another > thread you said (paraphrasing) correctly, that scripts should not be > relying upon bugs, and the current wait -n behaviour is a bug - that it > might have been intentionally coded that way doesn't make it any less so. > It isn't as if it was ever documented to work the way it does, or everyone > would have known about it already. > > | > Changing it to wait for all the listed pids > | It's never done that. > | We're not going to change the return value from wait. > > Good, I only mentioned those possibilities because your earlier > message was unclear about what "more like wait without -n" meant. > > | Yeah, but we're talking about bash here. It doesn't really matter what > | the Bourne shell did; there are likely plenty of scripts that assume > | the historical bash behavior. > > Really? Why? What's the point of collecting the status twice? > It can't change in the meantime can it, once a process has done exit(N) > its exit status should always be N, regardless of how often it is waited > upon. > > [Aside: this should be obvious, but when one is collecting status changes, > rather than just "terminated" status, then the pid isn't removed if it > returns a "stopped" or "continued" status.] > > | > I meant the distinction
Re: wait -n misses signaled subprocess
Date:Wed, 31 Jan 2024 11:35:57 -0500 From:Chet Ramey Message-ID: <1e50aa99-8d53-4cdf-ba5e-6aaf3ccc6...@case.edu> | Not quite. `new' in this sense is the opposite of `anything in the past' | as Dale described it -- already notified and removed from the jobs list. I guess the part about bash that I am not understanding here is how the "already notified" works. To me there are just two ways for that, either the user has done a "wait" which has collected that pid already (either without -n, and no pid args, or with pid args and one of those is the pid in question) or with -n and the pid in question was the one whose status was returned, or the user/script did the jobs command (or jobs -l) and the job in question was shown as completed. Is there some other way? | Half the problem here is that bash aggressively marks dead jobs as being | notified in non-interactive shells without job control enabled, and moves | them out of the jobs table. That might be more than half the problem, it might be the entire problem. | If you use wait -n without arguments, you probably don't care, No you do, that just means any of the children ... the script could make a list of all of them and supply that list, but if the list is just going to contain all the existing children, why bother?(With -n - and not exactly one pid arg, -p is generally going to be required, but that option has no bearing on which process is selected, or might be, which is the issue here). | but if you | do, or if you use wait -n with pid/job arguments (which you've presumably | saved yourself) you're going to need slightly different semantics than we | have now to answer that reliably. And that will probably need a new option. That's a pity, particularly since the current semantics don't seem to be useful in general. Since the sole issue provoking that seems to be the wait over and over policy, rather than "wait once, and remove completely" perhaps rather than a new, but different, -n like option, a better idea would be a "only once" option (ie: if the option (-r (remove) or -c (cleanup) or -o (once only)) is set, then when the wait with that option returns status or, or waits until termination without returning status (in the not -n case, with no pid args, or many pid args) then the processes are completely deleted from everywhere in the shell. Using that option would make a changed -n safe to use in loops. If you do that, also add an option (maybe the upper case version of whatever is selected for that one, or just some other letter) to mean "don't wait" (kind of like wait(2) WNOWAIT) - which in default bash would just be a no-op (except in posix mode, apparently - whereas the -[cor] option would be a no-op in posix mode). If you were to do that, other shells could add the same (except in probably all of them, -[cor] would always be the default, and the other one would be the one which changes behaviour). | And that's why I used `more': there are several differences, so which | of those differences should we attempt to change? Just the one. | > The one change that should be made is | > to allow wait -n to collect processes/jobs that have already terminated. | | Yes, that's one of the things we're talking about. I don't have any problem | with it, but should it take a new option to change those semantics? Good, though I think some more thought should go into that. In another thread you said (paraphrasing) correctly, that scripts should not be relying upon bugs, and the current wait -n behaviour is a bug - that it might have been intentionally coded that way doesn't make it any less so. It isn't as if it was ever documented to work the way it does, or everyone would have known about it already. | > Changing it to wait for all the listed pids | It's never done that. | We're not going to change the return value from wait. Good, I only mentioned those possibilities because your earlier message was unclear about what "more like wait without -n" meant. | Yeah, but we're talking about bash here. It doesn't really matter what | the Bourne shell did; there are likely plenty of scripts that assume | the historical bash behavior. Really? Why? What's the point of collecting the status twice? It can't change in the meantime can it, once a process has done exit(N) its exit status should always be N, regardless of how often it is waited upon. [Aside: this should be obvious, but when one is collecting status changes, rather than just "terminated" status, then the pid isn't removed if it returns a "stopped" or "continued" status.] | > I meant the distinction between processes | > that the shell has already collected status for, and those for which it | You're not the first to propose something like that, but I'm not going to | be writing that code any time soon. Nor am I, if you go back to the message where I first mentioned it, which I can't locate
Re: wait -n misses signaled subprocess
On 1/30/24 12:40 PM, Robert Elz wrote: | since this was the way -n worked orginally, before it started | paying attention to pid arguments. I'm not sure what the "this" is there, if you meant as I described it in my answer to your rhetorical question, viz: Find, or if there are none already, wait*(2) for, [...] If there's already a terminated job [...] then no wait type sys call gets performed then that seems to be in conflict with some of your other statements like: I won't ask you to look at the code, but yes, that's pretty much what it did: polled dead jobs to see if any could be returned because the user had not been notified, then made sure there were actual running background jobs and waited for one of them and returned the first one that exited. chet.ra...@case.edu said (replying to Dale R. Worley): | > It looks like the underlying meaning of "-n" is to only pay attention to | > *new* job completions, and anything "in the past" (already notified and | > moved to the table of terminated background jobs) is ignored. | That was the original implementation, yes. which is a different thing entirely. Not quite. `new' in this sense is the opposite of `anything in the past' as Dale described it -- already notified and removed from the jobs list. Jobs in the jobs list that hadn't been marked as notified were eligible to be returned, because to the user, they're new. Half the problem here is that bash aggressively marks dead jobs as being notified in non-interactive shells without job control enabled, and moves them out of the jobs table. | Right -- it works on the list of running background jobs. I know it is hard, but for determining what should happen, we need to keep thoughts of the current implementation details out of this, as while I'm sure you know exactly what that means, most others will not. It's pretty much the original implementation as I described it above. The running background jobs part kicks in after the `dead but not notified' part. What matters (to a script writer) is whether or not the processes listed (if any) have had their status collected before or not - if not, then any process (job) eligible (in the arg list of pids if there is one, or just any) which has returned some status should be returned (if there are multiple, any one of them) and if there are none, then we wait(2) until one does change status. What exactly "Running background jobs" means there is not clear (to me anyway). OK. What's the mechanism by which they find out which processes are in the state where the current version of wait -n will work on them?Assume there are multiple running (or perhaps recently ended) processes, and we want to process each as it ends (or soon after, given multiple might end around the same time). If you use wait -n without arguments, you probably don't care, but if you do, or if you use wait -n with pid/job arguments (which you've presumably saved yourself) you're going to need slightly different semantics than we have now to answer that reliably. And that will probably need a new option. | The real question is whether or not | we should extend `wait -n' to behave more like `wait' without options. That's not an answerable question, as there are several differences between wait -n and wait without -n (which is what I assume you mean by "wait without options"). The bash/posix semantics for `wait' without -n, for which you can ignore -p and -f. And that's why I used `more': there are several differences, so which of those differences should we attempt to change? The one change that should be made is to allow wait -n to collect processes/jobs that have already terminated. Yes, that's one of the things we're talking about. I don't have any problem with it, but should it take a new option to change those semantics? Changing it to wait for all the listed pids (which would make it behave more like wait without -n) is not desirable. It's never done that. Nor is changing a simple "wait -n" (no pid args, the presence, or not, of -p or -f is irrelevant) to always exit with status 0 - which is what "wait" does. So, please be clear. We're not going to change the return value from wait. | Why impose that requirement when it's never existed before? Never existed before in what? In bash, perhaps. In standard Bourne shells (and POSIX), this isn't at all new, it has always been required to wait for background processes (or allow the list of saved status to overflow, and old ones to be discarded). Yeah, but we're talking about bash here. It doesn't really matter what the Bourne shell did; there are likely plenty of scripts that assume the historical bash behavior. There was never any implicit "clean up when X happens" which is what bash seems to do (in non-interactive shells, interactive ones clean up before PS1 is written). And? | Bash `wait' already has -f to return only
Re: wait -n misses signaled subprocess
On 1/30/24 4:28 PM, Chet Ramey wrote: It's not a bug, bash has allowed multiple waits for the same pid for decades. bash works the way posix says it should for wait (without -n) in posix mode. I think this is a bug in bash posix mode, actually. `wait -n' should remove the job completely, since it's been `successfully waited for' and the language you quoted came out of interp 1254 and will be in the next revision. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ OpenPGP_signature.asc Description: OpenPGP digital signature
Re: wait -n misses signaled subprocess
On 1/30/24 2:30 PM, Robert Elz wrote: | If wait -n | looked at terminated processes you'd return jobs repeatedly and | possibly end up in an infinite loop. That's another bash bug, POSIX says: It's not a bug, bash has allowed multiple waits for the same pid for decades. bash works the way posix says it should for wait (without -n) in posix mode. With wait -n, the shell should look to see if any of the process id's listed is currently terminated, and if so, return status of one of those (and remove it from the lists). If none are terminated, it should look to see if any of the pids are for non-terminated jobs (or processes) and if so, just do a wait() until some child changes status. If that one is one that is in the list being waited for, then return its status (and remove it from the lists) otherwise just change the status of that process in the lists (including remembering the exit status if that is what this was), and wait() again - eventually one of them should change status (that or the shell will be interrupted by a signal, ending the wait utility). If none of the pids given in the arg list are known to the shell then it should return 127. We can have these different semantics with a new option. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ OpenPGP_signature.asc Description: OpenPGP digital signature
Re: wait -n misses signaled subprocess
Date:Tue, 30 Jan 2024 10:14:10 -0500 From:Steven Pelley Message-ID: | If wait -n | looked at terminated processes you'd return jobs repeatedly and | possibly end up in an infinite loop. That's another bash bug, POSIX says: Once a process ID that is known in the current shell execution environment (see Section 2.13, on page 2522) has been successfully waited for, it shall be removed from the list of process IDs that are known in the current shell execution environment. If the process ID is associated with a background job, the corresponding job shall also be removed from the list of background jobs. That is, if you wait for the same pid again, then all you can get is a 127 status (that pid is not known, or should not be). With wait -n, the shell should look to see if any of the process id's listed is currently terminated, and if so, return status of one of those (and remove it from the lists). If none are terminated, it should look to see if any of the pids are for non-terminated jobs (or processes) and if so, just do a wait() until some child changes status. If that one is one that is in the list being waited for, then return its status (and remove it from the lists) otherwise just change the status of that process in the lists (including remembering the exit status if that is what this was), and wait() again - eventually one of them should change status (that or the shell will be interrupted by a signal, ending the wait utility). If none of the pids given in the arg list are known to the shell then it should return 127. Do that, properly, and the loop will always terminate, whether or not you remove each pid from the list of pending ones as its status is returned. bash's habit of holding these things forever is weird, but certainly explains some of Chet's concerns with list sizes and such. Incidentally, the example code given is not a good example of the issue. In that, if the first background sleep is allowed to finish, before the wait -n loop starts, bash still returns its status (achieve that by making the sleep's be for longer, except the first, then add a (fg) sleep 2 into the script before the loop starts. Whatever condition is required to trigger the behaviour that is being objected to doesn't occur in that case. kre
Re: wait -n misses signaled subprocess
Date:Tue, 30 Jan 2024 09:16:47 -0500 From:Chet Ramey Message-ID: <95841ed3-ec4f-4b17-802c-86e560b58...@case.edu> | since this was the way -n worked orginally, before it started | paying attention to pid arguments. I'm not sure what the "this" is there, if you meant as I described it in my answer to your rhetorical question, viz: Find, or if there are none already, wait*(2) for, [...] If there's already a terminated job [...] then no wait type sys call gets performed then that seems to be in conflict with some of your other statements like: chet.ra...@case.edu said (replying to Dale R. Worley): | > It looks like the underlying meaning of "-n" is to only pay attention to | > *new* job completions, and anything "in the past" (already notified and | > moved to the table of terminated background jobs) is ignored. | That was the original implementation, yes. which is a different thing entirely. | Right -- it works on the list of running background jobs. I know it is hard, but for determining what should happen, we need to keep thoughts of the current implementation details out of this, as while I'm sure you know exactly what that means, most others will not. What matters (to a script writer) is whether or not the processes listed (if any) have had their status collected before or not - if not, then any process (job) eligible (in the arg list of pids if there is one, or just any) which has returned some status should be returned (if there are multiple, any one of them) and if there are none, then we wait(2) until one does change status. What exactly "Running background jobs" means there is not clear (to me anyway). But if it were to mean only processes that haven't previously terminated, how is the script writer meant to handle that? What's the mechanism by which they find out which processes are in the state where the current version of wait -n will work on them?Assume there are multiple running (or perhaps recently ended) processes, and we want to process each as it ends (or soon after, given multiple might end around the same time). | The real question is whether or not | we should extend `wait -n' to behave more like `wait' without options. That's not an answerable question, as there are several differences between wait -n and wait without -n (which is what I assume you mean by "wait without options"). The one change that should be made is to allow wait -n to collect processes/jobs that have already terminated. Changing it to wait for all the listed pids (which would make it behave more like wait without -n) is not desirable. Nor is changing a simple "wait -n" (no pid args, the presence, or not, of -p or -f is irrelevant) to always exit with status 0 - which is what "wait" does. So, please be clear. | Why impose that requirement when it's never existed before? Never existed before in what? In bash, perhaps. In standard Bourne shells (and POSIX), this isn't at all new, it has always been required to wait for background processes (or allow the list of saved status to overflow, and old ones to be discarded). There was never any implicit "clean up when X happens" which is what bash seems to do (in non-interactive shells, interactive ones clean up before PS1 is written). | Bash `wait' already has -f to return only when the specified job(s) has | terminated, reserving -t for some future use. No, that's what I meant, -f is making the distinction between terminated and some other status change. I meant the distinction between processes that the shell has already collected status for, and those for which it is yet to do so - ie: to add an option more or less equiv to WNOHANG in the wait*(2) sys calls (the ones that have flags). The shell could simply never do a wait(2) family sys call when the option is set, or if it does one, to see if there might be a zombie waiting to be reaped, then it should set WNOHANG when it does, to avoid the script from pausing. | There's no reason to keep thousands of terminated jobs in the jobs list, | slowing everything down, as long as you give users a way to retrieve their | status. This is just implementation detail, as long as it behaves correctly, what optimisations the implementation chooses to make are irrelevant. | You can run thousands of background jobs in a loop without exceeding the | max process limit. It depends just what those jobs are. For something like while true; do :& done then yes, sure as the jobs all terminate quite quickly, and as the shell collects the zombies as soon as they become available (more or less) the limit never gets reached. But those kinds of things are rarely useful to anyone except those doing torture tests. More likely would be something like while true; do sleep 1 & done where the "sleep" is just a placeholder for anything meaningful which is going to take appreciable time to complete. In
Re: wait -n misses signaled subprocess
Apologies for a typo: With the discussed change this would return 44080: 1 in an endless loop. 1, not 0 On Tue, Jan 30, 2024 at 10:14 AM Steven Pelley wrote: > > > OK. Can you think of a use case that would break if wait -n looked at > > terminated processes? > > Yes. If one were to start a number of bg jobs and repeatedly send the > list of pids to wait -n (probably redirecting stderr to /dev/null to > ignore messages about unknown jobs) today you'd process the jobs one > at a time, assuming no races between job completion. If wait -n > looked at terminated processes you'd return jobs repeatedly and > possibly end up in an infinite loop. > > Example: > # associate array used for consistency with later example > declare -A pids > { sleep 1; exit 1; } & > pids[$!]="" > { sleep 2; exit 2; } & > pids[$!]="" > { sleep 3; exit 3; } & > pids[$!]="" > > status=0 > while [ $status -ne 127 ]; do > unset finished_pid > wait -n -p finished_pid "${!pids[@]}" 2>/dev/null > status=$? > if [ -n "$finished_pid" ]; then > echo "$finished_pid: $status @${SECONDS}" > fi; > done > > gives a simple output like: > 44080: 1 @1 > 44081: 2 @2 > 44083: 3 @3 > > With the discussed change this would return 44080: 0 in an endless loop. > It would need to change to: > > while [ ${#pids[@]} -ne 0 ]; do > unset finished_pid > wait -n -p finished_pid "${!pids[@]}" > status=$? > if [ -n "$finished_pid" ]; then > echo "$finished_pid: $status @${SECONDS}" > fi; > unset pids[$finished_pid] > done > > Where the returned pid is unset in the array. I like this more but it > will break scripts that run correctly today.
Re: wait -n misses signaled subprocess
On 1/30/24 10:14 AM, Steven Pelley wrote: OK. Can you think of a use case that would break if wait -n looked at terminated processes? Yes. If one were to start a number of bg jobs and repeatedly send the list of pids to wait -n (probably redirecting stderr to /dev/null to ignore messages about unknown jobs) today you'd process the jobs one at a time, assuming no races between job completion. If wait -n looked at terminated processes you'd return jobs repeatedly and possibly end up in an infinite loop. OK, that argues for a new option to provide this functionality. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ OpenPGP_signature.asc Description: OpenPGP digital signature
Re: wait -n misses signaled subprocess
> OK. Can you think of a use case that would break if wait -n looked at > terminated processes? Yes. If one were to start a number of bg jobs and repeatedly send the list of pids to wait -n (probably redirecting stderr to /dev/null to ignore messages about unknown jobs) today you'd process the jobs one at a time, assuming no races between job completion. If wait -n looked at terminated processes you'd return jobs repeatedly and possibly end up in an infinite loop. Example: # associate array used for consistency with later example declare -A pids { sleep 1; exit 1; } & pids[$!]="" { sleep 2; exit 2; } & pids[$!]="" { sleep 3; exit 3; } & pids[$!]="" status=0 while [ $status -ne 127 ]; do unset finished_pid wait -n -p finished_pid "${!pids[@]}" 2>/dev/null status=$? if [ -n "$finished_pid" ]; then echo "$finished_pid: $status @${SECONDS}" fi; done gives a simple output like: 44080: 1 @1 44081: 2 @2 44083: 3 @3 With the discussed change this would return 44080: 0 in an endless loop. It would need to change to: while [ ${#pids[@]} -ne 0 ]; do unset finished_pid wait -n -p finished_pid "${!pids[@]}" status=$? if [ -n "$finished_pid" ]; then echo "$finished_pid: $status @${SECONDS}" fi; unset pids[$finished_pid] done Where the returned pid is unset in the array. I like this more but it will break scripts that run correctly today.
Re: wait -n misses signaled subprocess
On 1/30/24 9:11 AM, Steven Pelley wrote: It does look in the table of saved exit statuses, returning 1. It doesn't. In this case, the code path it follows marks the job as dead but doesn't mark it as notified (since it exited normally), so it's still in the jobs list when `wait -n' is called, and available for returning. That's probably a bug there. Got it. So wait -n is intended to behave just as the documentation says -- "next" job -- and if there's a bug it's with how normally-exiting processes are handled, not signal-exiting processes. Thank you for your patience. This has raised several other questions: whether `wait -n' should work more like `wait' (see below) and whether non-interactive shells without job control enabled should be so aggressive at marking jobs as notified, since it's that state that allows them to move to the list of terminated processes. There's also an interaction in that "wait" will only look at the terminated table if "-n" is not specified *and* ids are specified. This is to maintain POSIX semantics, with extensions. This is one of the issues -- should `wait -n' with arguments look for terminated processes in that table, the way `wait' without options does? Yes, I do want wait -n to look in the terminated table, at least for my use case responding to jobs finishing, one at a time, as soon as possible. OK. Can you think of a use case that would break if wait -n looked at terminated processes? I _don't_ want bash to maintain some sort of internal state about which jobs have and haven't been returned by wait -n, which would be complicated and brittle (this is what my mental model was). I'd want it to look in the terminated table for finished jobs amongst the provided list of pids, and then I'd manage the list of pids myself, removing pids that were previously returned from wait -n. This is a change in semantics and might introduce inconsistencies and difficulty implementing, I'm just describing what I think would be useful for my specific needs. It's not difficult to implement. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ OpenPGP_signature.asc Description: OpenPGP digital signature
Re: wait -n misses signaled subprocess
On 1/29/24 3:49 PM, Robert Elz wrote: Date:Mon, 29 Jan 2024 12:07:53 -0500 From:Chet Ramey Message-ID: | What does `wait -n' without job arguments mean? Find, or if there are none already, wait*(2) for, a process (job technically) that has changed state (terminated in POSIX, and one day in the NetBSD shell, that difference isn't relevant here) and return its status. If there's already a terminated job (job which has changed status in bash) then no wait type sys call gets performed (that already happened). That was mostly a rhetorical question, since this was the way -n worked orginally, before it started paying attention to pid arguments. Implicit here is the notification/changed state issue we've been discussing. It also returns the status of that process, rather than simple "0" which a bare "wait" does (and with the appropriate arg, tells you which process it was). Not originally, but -p var was a useful addition. | OK. Since wait without options can already wait for the same pid multiple | times, the -n option has to bring some new functionality here. Yes, without args, it waits until all listed arg processes (jobs) are finished (or changed state) and returns the status of the last. With -n it waits for any one of them, just as the bash man page says it will. The "any one" (vs "all") is the new functionality. Right -- it works on the list of running background jobs. | As long as it's still in the jobs list. Yes, of course - the final para of my message covered that case. | OK. We can agree there shouldn't be any difference between `wait pid' | and `wait -n pid'. Yes, but just because that's a degenerate case of the more general commands, which happens in each case to devolve into the same thing. Add more `pid' arguments, if you like. The real question is whether or not we should extend `wait -n' to behave more like `wait' without options. And from a different message: chet.ra...@case.edu said: | So should the shell require the user to periodically run `wait' in a non- | interactive shell without job control to clean dead jobs out of the jobs | list? I don't think so. I do. wait or jobs ("jobs >/dev/null" is a nice simple clean up, without the potential hang waiting for things to terminate that the wait utility imposes). Why impose that requirement when it's never existed before? If you want to do it, go ahead, but we shouldn't be making that a requirement now. A new option to wait(1) (either a simple one, perhaps -t, to only wait for already terminated jobs, Bash `wait' already has -f to return only when the specified job(s) has terminated, reserving -t for some future use. Of course, you're also allowed to dump processes from the lists if there get to be too many of them, but on modern systems, it really should be possible to retain hundreds, if not thousands, without any real problem. There's no reason to keep thousands of terminated jobs in the jobs list, slowing everything down, as long as you give users a way to retrieve their status. It's also a bit unusual for non-interactive code to run lots of async jobs without waiting for results - doing that is a sure way to run into the "max user processes" limit, and have things start failing. You can run thousands of background jobs in a loop without exceeding the max process limit. People doing that is what got us here in the first place. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ OpenPGP_signature.asc Description: OpenPGP digital signature
Re: wait -n misses signaled subprocess
> > It does look in the table of saved exit statuses, returning 1. > > It doesn't. In this case, the code path it follows marks the job as dead > but doesn't mark it as notified (since it exited normally), so it's still > in the jobs list when `wait -n' is called, and available for returning. > That's probably a bug there. Got it. So wait -n is intended to behave just as the documentation says -- "next" job -- and if there's a bug it's with how normally-exiting processes are handled, not signal-exiting processes. Thank you for your patience. > > There's also an interaction in that "wait" will only look at the > > terminated table if "-n" is not specified *and* ids are specified. > > This is to maintain POSIX semantics, with extensions. This is one of the > issues -- should `wait -n' with arguments look for terminated processes > in that table, the way `wait' without options does? Yes, I do want wait -n to look in the terminated table, at least for my use case responding to jobs finishing, one at a time, as soon as possible. I don't think wait -n can reliably do this since there is always a race between a job finishing/being handled, the next job finishing, and the subsequent call to wait -n. Even if I query "jobs" to see if multiple jobs have terminated, the next finishing job could still race. You've pointed out clearly that my mental model of wait -n was wrong so please bear with me if I still don't have this right. Is there some other best practice for this use case? It might be "use a SIGCHLD handler and query jobs to see what jobs have terminated, then call wait on each" or "I don't recommend using bash/sh for this." Obviously I could also be overlooking some aspect of wait -n or other bash features that would help here. I _don't_ want bash to maintain some sort of internal state about which jobs have and haven't been returned by wait -n, which would be complicated and brittle (this is what my mental model was). I'd want it to look in the terminated table for finished jobs amongst the provided list of pids, and then I'd manage the list of pids myself, removing pids that were previously returned from wait -n. This is a change in semantics and might introduce inconsistencies and difficulty implementing, I'm just describing what I think would be useful for my specific needs. A bit of brainstorming: between Linux's pidfds and BSD's kqueue/process descriptors one ought to be able to build this as an external command that polls for non-child processes to terminate. It couldn't return an exit status, but it could at least indicate which process finished or couldn't be found and thus had already finished. Then you could use posix "wait " to get the exit status and be guaranteed that it wouldn't block (a simple timeout option to wait might be useful here for cases where bash's child process may not be visible to an external command). I'm not aware of anything like this existing, but it would be a nice way to separate this functionality from the shell, reduce the number of options in wait, and support other shells. Again, thanks for your patience Chet, Steve
Re: wait -n misses signaled subprocess
On 1/28/24 10:26 PM, Dale R. Worley wrote: Chet Ramey writes: echo "wait -n $pid return code $? @${SECONDS} (BUG)" The job isn't in the jobs table because you've already been notified about it and it's not `new', you get the unknown job error status. The man page gives a lot of details and I'm trying to digest them into a structure. It looks like the underlying meaning of "-n" is to only pay attention to *new* job completions, and anything "in the past" (already notified and moved to the table of terminated background jobs) is ignored. That was the original implementation, yes. The idea was to add something to augment the `wait for all' strategy of wait without pid arguments. The underlying meaning of providing one or more ids is that "wait" is to only be concerned with those jobs. Right, that's the current operation. The man page doesn't make clear that if you don't specify "-n" and do supply ids and one of them has already terminated, you'll get its status (from the terminated table); the wording suggests that "wait" will always *wait for* a termination. Only if your mental model of the operation links the wait builtin and wait system call. If the pid has already terminated, the wait is immediate, and there's no reason to call the system call, but wait still returns the status. (As an aside, that is one long paragraph describing `wait'. I need to break that up.) There's also an interaction in that "wait" will only look at the terminated table if "-n" is not specified *and* ids are specified. This is to maintain POSIX semantics, with extensions. This is one of the issues -- should `wait -n' with arguments look for terminated processes in that table, the way `wait' without options does? Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ OpenPGP_signature.asc Description: OpenPGP digital signature
Re: wait -n misses signaled subprocess
Date:Mon, 29 Jan 2024 12:07:53 -0500 From:Chet Ramey Message-ID: | What does `wait -n' without job arguments mean? Find, or if there are none already, wait*(2) for, a process (job technically) that has changed state (terminated in POSIX, and one day in the NetBSD shell, that difference isn't relevant here) and return its status. If there's already a terminated job (job which has changed status in bash) then no wait type sys call gets performed (that already happened). It also returns the status of that process, rather than simple "0" which a bare "wait" does (and with the appropriate arg, tells you which process it was). | OK. Since wait without options can already wait for the same pid multiple | times, the -n option has to bring some new functionality here. Yes, without args, it waits until all listed arg processes (jobs) are finished (or changed state) and returns the status of the last. With -n it waits for any one of them, just as the bash man page says it will. The "any one" (vs "all") is the new functionality. | As long as it's still in the jobs list. Yes, of course - the final para of my message covered that case. | OK. We can agree there shouldn't be any difference between `wait pid' | and `wait -n pid'. Yes, but just because that's a degenerate case of the more general commands, which happens in each case to devolve into the same thing. And from a different message: chet.ra...@case.edu said: | So should the shell require the user to periodically run `wait' in a non- | interactive shell without job control to clean dead jobs out of the jobs | list? I don't think so. I do. wait or jobs ("jobs >/dev/null" is a nice simple clean up, without the potential hang waiting for things to terminate that the wait utility imposes). A new option to wait(1) (either a simple one, perhaps -t, to only wait for already terminated jobs, or a timeout, where 0 indicates never to wait at all (ie: don't do the wait sys call) which would be a more general, but more costly, mechanism). But as long as it is just a matter of cleaning up, and jobs works for that, I don't currently see the need. Of course, you're also allowed to dump processes from the lists if there get to be too many of them, but on modern systems, it really should be possible to retain hundreds, if not thousands, without any real problem. And of course, you're not required to retain status of any job if there's no way that the script can request it - but determining that these days is difficult. It used to be easy in the Sys V/POSIX model where if $! wasn't saved, then there was no way for the script to request the status, as it couldn't (reasonably - parsing job trees from ps output doesn't count) find out the pid to wait for (and simple "wait" never returns any status). These days, with the jobs command available, a script could do pids=$(jobs -l | code to parse the output and print the pids) and determine what it can wait for that way (the code isn't difficult) - and it can also wait on %1 %2 ... without having any idea what the pids might be, so in practice adding the (non-trivial) code to monitor references to $! isn't worth the bother (IMO). It's also a bit unusual for non-interactive code to run lots of async jobs without waiting for results - doing that is a sure way to run into the "max user processes" limit, and have things start failing. If there are less than that, then having the shell retain the info until the script terminates isn't really a very big cost, should the script not bother to ever clean up. | I think it's whether or not `wait -n pid' behaves the same as `wait pid' and | looks in the list of saved exit statuses if the pid isn't found in a job in | the jobs list. We have it simpler than that, there's just one list, which serves both purposes. Makes things easier I believe (in all three of: shell code, shell doc, and user understanding), even if it does consume a few more bytes for a little longer than is really needed (jobs needs the command strings, so they can be printed, wait doesn't, so retaining that is an extra cost ... not one large enough for anyone to have ever noticed though). kre
Re: wait -n misses signaled subprocess
On 1/29/24 7:54 AM, Andreas Schwab wrote: On Jan 29 2024, Robert Elz wrote: I always wondered why the option was 'n' n = next? Yes: the original implementation polled the non-terminated background jobs and returned when one of them exited. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ OpenPGP_signature.asc Description: OpenPGP digital signature
Re: wait -n misses signaled subprocess
On 1/29/24 12:33 PM, Chet Ramey wrote: You should have. You told me about your implementation using `-n' in 10/2017, long before I implemented it (4/2020). Sorry, this is my mistake. That was a different feature. Bash implemented `wait -n' first. For those wondering, the `different feature' was having `wait -n' pay attention to its pid/job arguments. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ OpenPGP_signature.asc Description: OpenPGP digital signature
Re: wait -n misses signaled subprocess
On 1/29/24 12:07 PM, Chet Ramey wrote: On 1/29/24 7:12 AM, Robert Elz wrote: Date: Sun, 28 Jan 2024 18:21:42 -0500 From: Chet Ramey Message-ID: <3347f790-529b-4bee-91fd-de39bed3f...@case.edu> | because `wait -n' doesn't look in the table | of saved statuses -- its job is to wait for `new' jobs to terminate, not | ones that have already been removed from the table. That's very interesting, and most unexpected information. I always wondered why the option was 'n' - I would have made it be 'a' probably, as a shorthand for "any" - but then I decided that perhaps 'n' was a better choice, as "a" could also be "all", the option name would not be providing any real clue at all, so I assumed you'd been ultra clever and used 'n' as the next char in "any" and also as it can be read like the first part of "en" "ee" (which you need to say out loud, or at least in your head, to get the effect of). You should have. You told me about your implementation using `-n' in 10/2017, long before I implemented it (4/2020). Sorry, this is my mistake. That was a different feature. Bash implemented `wait -n' first. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ OpenPGP_signature.asc Description: OpenPGP digital signature
Re: wait -n misses signaled subprocess
On 1/29/24 7:12 AM, Robert Elz wrote: Date:Sun, 28 Jan 2024 18:21:42 -0500 From:Chet Ramey Message-ID: <3347f790-529b-4bee-91fd-de39bed3f...@case.edu> | because `wait -n' doesn't look in the table | of saved statuses -- its job is to wait for `new' jobs to terminate, not | ones that have already been removed from the table. That's very interesting, and most unexpected information. I always wondered why the option was 'n' - I would have made it be 'a' probably, as a shorthand for "any" - but then I decided that perhaps 'n' was a better choice, as "a" could also be "all", the option name would not be providing any real clue at all, so I assumed you'd been ultra clever and used 'n' as the next char in "any" and also as it can be read like the first part of "en" "ee" (which you need to say out loud, or at least in your head, to get the effect of). You should have. You told me about your implementation using `-n' in 10/2017, long before I implemented it (4/2020). It never even dawned on me that 'n' might mean "new", as in only processes that hadn't terminated at the time the wait -n was done, as that's essentially a recipe for script madness, race conditions galore, as the one reported here. What does `wait -n' without job arguments mean? What wait(1) needed was an alternative to its normal "all" semantic, just "wait" waits for every background job to terminate, what's needed is a way to wait for any one of them (whether already terminated, but not previously waited for or not). That's what I always assumed wait -n was doing, and how I implemented it in the NetBSD shell. OK. Since wait without options can already wait for the same pid multiple times, the -n option has to bring some new functionality here. Similarly "wait pid1 pid2 pid3" waits for all 3 of those to terminate, so "wait -n pid1 pid2 pid3" should wait for any one of them - already terminated or not. As long as it's still in the jobs list. When there's just one pid in the list, the -n option always seemed useless to me, there ought be no difference between "wait pid" and "wait -n pid" (as in wait for all of one, and wait for any of one, mean the same thing, wait for that one), but obviously should still be supported for consistency. OK. We can agree there shouldn't be any difference between `wait pid' and `wait -n pid'. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ OpenPGP_signature.asc Description: OpenPGP digital signature
Re: wait -n misses signaled subprocess
On 1/28/24 7:19 PM, Steven Pelley wrote: Thank you Chet for your thorough reply. You make a few comments about differences in output (stderr for not finding a job, notifications for jobs terminating) and in all cases I believe you are correct. Let's assume job control is disabled. OK, but remember: "When job control isn't enabled (usually in a non-interactive shell), the shell doesn't notify users about terminated background jobs, but it still removes dead jobs from the jobs list before reading the next command. It cleans the jobs table of notified jobs at other times, too, to move dead jobs out of the jobs list and keep it a manageable size." These exit statuses are still available to `wait pid' (but not `wait -n pid') as POSIX specfies. I expect the line ending (BUG) to indicate a return code of 143. It might, if `wait -n' looked for already-notified jobs in the table of saved exit statuses, but it doesn't. Should it, even if the user has already been notified of the status of that job? When job control is disabled I get this output for the same test (just for consistent reference): The results are consistent with what I described previously. There's no user notification of the job terminating because job control is disabled. The "wait -n" returning 127 is the first opportunity the shell might have to notify the user of the job. So should the shell require the user to periodically run `wait' in a non- interactive shell without job control to clean dead jobs out of the jobs list? I don't think so. In this context I think that "even if the user has already been notified of the status of that job" doesn't apply -- the user hasn't been notified of the job terminating. See above. Even so, this behavior differs from a similar example but where the first job ends successfully, or at least without being killed by a signal. It still terminates prior to calling "wait -n" (this is from Jan 24 but I'll post again to keep everything in a linear thread). echo "TEST: EXIT 0 PRIOR TO wait -n @${SECONDS}" { sleep 1; echo "child finishing @${SECONDS}"; exit 1; } & pid=$! echo "child proc $pid @${SECONDS}" sleep 2 wait -n $pid echo "wait -n $pid return code $? @${SECONDS}" output (no job control): TEST: EXIT 0 PRIOR TO wait -n @0 child proc 2779 @0 child finishing @1 wait -n 2779 return code 1 @2 It does look in the table of saved exit statuses, returning 1. It doesn't. In this case, the code path it follows marks the job as dead but doesn't mark it as notified (since it exited normally), so it's still in the jobs list when `wait -n' is called, and available for returning. That's probably a bug there. I think the sticking point is the notion of the user being notified of the status of a job. I think it's whether or not `wait -n pid' behaves the same as `wait pid' and looks in the list of saved exit statuses if the pid isn't found in a job in the jobs list. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ OpenPGP_signature.asc Description: OpenPGP digital signature
Re: wait -n misses signaled subprocess
On Mon, Jan 29, 2024 at 08:52:37PM +0700, Robert Elz wrote: > Date:Mon, 29 Jan 2024 13:54:10 +0100 > From:Andreas Schwab > Message-ID: > > | n = next? This was my assumption as well. > That would be a reasonable interpretation, I guess, but > unfortunately not one which helps the current question, > as it doesn't answer "next what?" For the record, with bash 5.2: unicorn:~$ cat foo #!/bin/bash sleep 1 & sleep 37 & sleep 2 time wait -n unicorn:~$ ./foo real 0.001 user 0.000 sys 0.001 unicorn:~$ ps PID TTY TIME CMD 1152 pts/300:00:00 bash 542197 pts/300:00:00 sleep 542200 pts/300:00:00 ps unicorn:~$ ps -fp 542197 UID PIDPPID C STIME TTY TIME CMD greg 542197 1 0 08:59 pts/300:00:00 sleep 37 wait -n *does* appear to acknowledge the already-terminated child process, despite a second child process still being active.
Re: wait -n misses signaled subprocess
Date:Mon, 29 Jan 2024 13:54:10 +0100 From:Andreas Schwab Message-ID: | n = next? That would be a reasonable interpretation, I guess, but unfortunately not one which helps the current question, as it doesn't answer "next what?" It could be "the next of these processes which terminates" (like the "new" interpretation) or "the next of these processes that has a status available" (like the "any" interpretation). While I'm here, I will also mention that the bash man page section for wait(1) does say "any" in one place, and equivalent (but better) wording in another ("a single job"), but never mentions "new" anywhere. Further in both the -n and no -n cases, the wait utility is stated to "wait for" (whatever is appropriate for the args given) hence the operation should be assumed to be the same in both cases, either an actual pause is required in both (until some appropriate process changes status) or is not required in either (if such a process has already terminated and is waiting for shell level reaping). Note that processes that have already been reported (via wait, or jobs, or the prompt level jobs lookalike) have already been reported, so if any of that had happened wait isn't expected to be able to fetch status from them again. kre
Re: wait -n misses signaled subprocess
On Jan 29 2024, Robert Elz wrote: > I always wondered why the option was 'n' n = next? -- Andreas Schwab, SUSE Labs, sch...@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different."
Re: wait -n misses signaled subprocess
Date:Sun, 28 Jan 2024 18:21:42 -0500 From:Chet Ramey Message-ID: <3347f790-529b-4bee-91fd-de39bed3f...@case.edu> | because `wait -n' doesn't look in the table | of saved statuses -- its job is to wait for `new' jobs to terminate, not | ones that have already been removed from the table. That's very interesting, and most unexpected information. I always wondered why the option was 'n' - I would have made it be 'a' probably, as a shorthand for "any" - but then I decided that perhaps 'n' was a better choice, as "a" could also be "all", the option name would not be providing any real clue at all, so I assumed you'd been ultra clever and used 'n' as the next char in "any" and also as it can be read like the first part of "en" "ee" (which you need to say out loud, or at least in your head, to get the effect of). It never even dawned on me that 'n' might mean "new", as in only processes that hadn't terminated at the time the wait -n was done, as that's essentially a recipe for script madness, race conditions galore, as the one reported here. What wait(1) needed was an alternative to its normal "all" semantic, just "wait" waits for every background job to terminate, what's needed is a way to wait for any one of them (whether already terminated, but not previously waited for or not). That's what I always assumed wait -n was doing, and how I implemented it in the NetBSD shell. Similarly "wait pid1 pid2 pid3" waits for all 3 of those to terminate, so "wait -n pid1 pid2 pid3" should wait for any one of them - already terminated or not. When there's just one pid in the list, the -n option always seemed useless to me, there ought be no difference between "wait pid" and "wait -n pid" (as in wait for all of one, and wait for any of one, mean the same thing, wait for that one), but obviously should still be supported for consistency. To think that it might be interpreted as "wait for a new process "pid" to terminate, ignoring the one that just finished a few milliseconds ago" is simply astounding, completely unbelievable. And from what I have seen of the other comments, several from long term & dedicated bash users, it is just as astounding to them as well. Please treat this as a bug, and fix it. Quickly. kre
Re: wait -n misses signaled subprocess
On Monday, January 29, 2024, Greg Wooledge wrote: > > Anyway... a script writer who has a basic familiarity with wait(2) and > who reads about "wait -n" will probably assume that wait -n will return > immediately if a child process has already terminated and hasn't been > "pseudo-reaped" by a previous "wait" command yet. If three children > have terminated, then the next three "wait -n" commands should return > immediately, and the fourth should block (assuming a fourth child exists). > This is the case with me. There is no point in having `wait -n' if it can't distinguish a single job terminating from multiple jobs terminating between subsequent calls. -- Oğuz
Re: wait -n misses signaled subprocess
On Sun, Jan 28, 2024 at 10:26:27PM -0500, Dale R. Worley wrote: > The man page doesn't make clear that if you don't specify "-n" and do > supply ids and one of them has already terminated, you'll get its status > (from the terminated table); the wording suggests that "wait" will > always *wait for* a termination. This might be a result of C programmers who already know the semantics of wait(2) writing documentation which assumes the reader *also* knows these semantics. wait(2) and its brethren return immediately if the process in question has already terminated. It's how you reap the zombie and free up the process table slot, while also retrieving its exit status. If it's not already dead, then wait(2) blocks until death occurs. The shell's "wait" command is meant to mimic this behavior, at its core. There are some differences, however -- notably, the shell aggressively reaps zombies and stores their exit statuses in memory, revealing them to you in the event that you call "wait". Normally this change is invisible, but if you were *counting* on the zombie to be there, holding on to that PID, preventing it from being reused until you could observe the death and react to it, then you're screwed. Don't use the shell for this. Anyway... a script writer who has a basic familiarity with wait(2) and who reads about "wait -n" will probably assume that wait -n will return immediately if a child process has already terminated and hasn't been "pseudo-reaped" by a previous "wait" command yet. If three children have terminated, then the next three "wait -n" commands should return immediately, and the fourth should block (assuming a fourth child exists).
Re: wait -n misses signaled subprocess
Chet Ramey writes: >> echo "wait -n $pid return code $? @${SECONDS} (BUG)" > > The job isn't in the jobs table because you've already been notified about > it and it's not `new', you get the unknown job error status. The man page gives a lot of details and I'm trying to digest them into a structure. It looks like the underlying meaning of "-n" is to only pay attention to *new* job completions, and anything "in the past" (already notified and moved to the table of terminated background jobs) is ignored. The underlying meaning of providing one or more ids is that "wait" is to only be concerned with those jobs. The man page doesn't make clear that if you don't specify "-n" and do supply ids and one of them has already terminated, you'll get its status (from the terminated table); the wording suggests that "wait" will always *wait for* a termination. There's also an interaction in that "wait" will only look at the terminated table if "-n" is not specified *and* ids are specified. Am I understanding this correctly? Dale
Re: wait -n misses signaled subprocess
Thank you Chet for your thorough reply. You make a few comments about differences in output (stderr for not finding a job, notifications for jobs terminating) and in all cases I believe you are correct. Let's assume job control is disabled. > > > > I expect the line ending (BUG) to indicate a return code of 143. > > It might, if `wait -n' looked for already-notified jobs in the table of > saved exit statuses, but it doesn't. Should it, even if the user has > already been notified of the status of that job? When job control is disabled I get this output for the same test (just for consistent reference): TEST: KILL PRIOR TO wait -n @0 kill -TERM 526 @0 ./test.sh: line 13: wait: 526: no such job wait -n 526 return code 127 @2 (BUG) wait 526 return code 143 @2 TEST: KILL DURING wait -n @2 kill -TERM 544 @3 wait -n 544 return code 143 @3 wait 544 return code 143 @3 There's no user notification of the job terminating because job control is disabled. The "wait -n" returning 127 is the first opportunity the shell might have to notify the user of the job. In this context I think that "even if the user has already been notified of the status of that job" doesn't apply -- the user hasn't been notified of the job terminating. It's possible you are saying that the user was notified of the job's termination in some other way that I missed, so please tell me if I'm misunderstanding this part. Even so, this behavior differs from a similar example but where the first job ends successfully, or at least without being killed by a signal. It still terminates prior to calling "wait -n" (this is from Jan 24 but I'll post again to keep everything in a linear thread). echo "TEST: EXIT 0 PRIOR TO wait -n @${SECONDS}" { sleep 1; echo "child finishing @${SECONDS}"; exit 1; } & pid=$! echo "child proc $pid @${SECONDS}" sleep 2 wait -n $pid echo "wait -n $pid return code $? @${SECONDS}" output (no job control): TEST: EXIT 0 PRIOR TO wait -n @0 child proc 2779 @0 child finishing @1 wait -n 2779 return code 1 @2 It does look in the table of saved exit statuses, returning 1. I think the sticking point is the notion of the user being notified of the status of a job. In these examples I don't see that the user is notified prior to the first call to "wait -n," and so I think that this call should notify the user. This first call to "wait -n" _does_ notify the user in the case that the job terminated by exiting (not signalled), but _does not_ notify the user in the case that the job was killed. Steve
Re: wait -n misses signaled subprocess
On 1/22/24 11:30 AM, Steven Pelley wrote: I've tried: killing with SIGTERM and SIGALRM killing from the test script, a subshell, and another terminal. I don't believe this is related to kill being a builtin. enabling job control (set -m) bash versions 4.4.12, 5.2.15, 5.2.21. All linux arm64 You must have left `set -m' enabled in the version whose results you posted, since you don't get non-interactive status notifications unless you do. Let's see if we can go through what happens. Part of it has to do with notifications and when the shell removes jobs from the jobs table. When the shell is interactive, and job control is enabled, it checks for terminated background jobs, notifies the user about their status if appropriate, and removes them from the jobs list -- bash removes a job from the list when it's notified the user of its status -- when it goes to read a new command, before printing the prompt. In a non-interactive shell, it obviously doesn't print a prompt, but it does the same thing, even the notification, before reading the next command. When job control isn't enabled (usually in a non-interactive shell), the shell doesn't notify users about terminated background jobs, but it still removes dead jobs from the jobs list before reading the next command. It cleans the jobs table of notified jobs at other times, too, to move dead jobs out of the jobs list and keep it a manageable size. The shell does keep a table of terminated background jobs that have been removed from the jobs list, because POSIX says you have to keep track of the last CHILD_MAX pids and make their exit statuses available to `wait' (but see below). Test script: # change to test other signals sig=TERM echo "TEST: KILL PRIOR TO wait -n @${SECONDS}" { sleep 1; exit 1; } & > pid=$! This ends up adding this to the jobs table as job 1. $pid is the pgrp leader. echo "kill -$sig $pid @${SECONDS}" kill -$sig $pid You kill that job, it terminates, the shell gets the SIGCHLD and waits for it, marks it as dead in the jobs table, and goes to read the next command. It doesn't matter whether this happens before the sleep or the wait; the job gets removed as soon as the user is notified and moved to the table of saved statuses. (If the shell isn't doing notifications, the job just gets moved.) sleep 2 wait -n $pid When I run this, whether job control is enabled or not, I get an error message about an unknown job, because `wait -n' doesn't look in the table of saved statuses -- its job is to wait for `new' jobs to terminate, not ones that have already been removed from the table. Maybe you're redirecting stderr. echo "wait -n $pid return code $? @${SECONDS} (BUG)" The job isn't in the jobs table because you've already been notified about it and it's not `new', you get the unknown job error status. wait $pid > echo "wait $pid return code $? @${SECONDS}" This works, because wait without -n looks in the table of saved statuses. echo "TEST: KILL DURING wait -n @${SECONDS}" { sleep 2; exit 1; } & pid=$! { sleep 1; echo "kill -$sig $pid @${SECONDS}"; kill -$sig $pid; } & wait -n $pid The shell doesn't get the SIGCHLD before running wait, so the job is still in the jobs list. echo "wait -n $pid return code $? @${SECONDS}" wait $pid echo "wait $pid return code $? @${SECONDS}" And you get the same status here. Even though the `wait -n' removes the job from the jobs list, the subsequent `wait' can still find it in the table of saved exit statuses. For which I get the following example output: TEST: KILL PRIOR TO wait -n @0 kill -TERM 1384 @0 ./test.sh: line 14: 1384 Terminated { sleep 1; exit 1; } wait -n 1384 return code 127 @2 (BUG) wait 1384 return code 143 @2 TEST: KILL DURING wait -n @2 kill -TERM 1402 @3 ./test.sh: line 25: 1402 Terminated { sleep 2; exit 1; } wait -n 1402 return code 143 @3 wait 1402 return code 143 @3 I expect the line ending (BUG) to indicate a return code of 143. It might, if `wait -n' looked for already-notified jobs in the table of saved exit statuses, but it doesn't. Should it, even if the user has already been notified of the status of that job? Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ OpenPGP_signature.asc Description: OpenPGP digital signature
Re: wait -n misses signaled subprocess
On Mon, Jan 22, 2024 at 8:13 PM Steven Pelley wrote: > > Hello, > I've encountered what I believe is a bug in bash's "wait -n". wait -n > fails to return for processes that terminate due to a signal prior to > calling wait -n. Instead, it returns 127 with an error that the > process id cannot be found. Calling wait (without -n) then > returns its exit code (e.g., 143). I expect wait -n to return each > process through successive calls to wait -n, which is the case for > processes that terminate in other manners even prior to calling wait > -n. I agree that this is a bug in bash. jobs.c/wait_for_any_jobs() marks all dead jobs as notified after reporting the status of the first one and misses the rest. With the following change (not a real fix, just for demonstration), devel branch behaves as expected: diff --git a/jobs.c b/jobs.c index 3e68bf24..d7c8d11b 100644 --- a/jobs.c +++ b/jobs.c @@ -3257,7 +3257,7 @@ wait_for_any_job (int flags, struct procstat *ps) { if ((flags & JWAIT_WAITING) && jobs[i] && IS_WAITING (i) == 0) continue; /* if we don't want it, skip it */ - if (jobs[i] && DEADJOB (i) && IS_NOTIFIED (i) == 0 && IS_FOREGROUND (i) == 0) + if (jobs[i] && DEADJOB (i) && IS_FOREGROUND (i) == 0) { return_job: r = job_exit_status (i);
Re: wait -n misses signaled subprocess
Apologies for a quick double post, strace is fairly straightforward and confirms that bash is properly reaping the killed processes. This isn't a matter of the wait syscall failing to return the signaled child process. Running the test from my original post and producing: TEST: KILL PRIOR TO wait -n @0 kill -TERM 6941 @0 ./test.sh: line 13: wait: 6941: no such job wait -n 6941 return code 127 @2 (BUG) wait 6941 return code 143 @2 TEST: KILL DURING wait -n @2 kill -TERM 6970 @3 wait -n 6970 return code 143 @3 wait 6970 return code 143 @3 shows: kill(6941, SIGTERM) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=6941, si_uid=1000, si_status=SIGTERM, si_utime=0, si_stime=0} --- wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], WNOHANG, NULL) = 6941 wait4(-1, 0xc62b6d50, WNOHANG, NULL) = -1 ECHILD (No child processes) rt_sigreturn({mask=[]}) and wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], 0, NULL) = 6970 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, {sa_handler=0xd98a21a4, sa_mask=[], sa_flags=0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=6970, si_uid=1000, si_status=SIGTERM, si_utime=0, si_stime=0} --- wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 6972 wait4(-1, 0xc62b6860, WNOHANG, NULL) = -1 ECHILD (No child processes) rt_sigreturn({mask=[]}) Signaling prior to wait -n (pid 6941) is awaited (wait4) in the SIGCHLD signal handler and determines that it was signaled and terminated due to SIGTERM. Signaling during wait -n (pid 6970) is awaited prior to the SIGCHLD signal indicating it was killed by a blocking call to wait4, also returning that it was signaled and terminated due to SIGTERM. The only difference I see here is whether the subprocess is awaited by the blocking call rather than the nonblocking call inside the SIGCHLD handler. For what it's worth I see subprocesses that terminate without signal also showing up in wait4 calls outside the SIGCHLD handler but this could easily be a matter of chance timing and a red herring. Steve On Wed, Jan 24, 2024 at 12:40 PM Steven Pelley wrote: > > > In the first case, if the subprocess N has terminated, its report is > > still queued and "wait" retrieves it. In the second case, if the > > subprocess N has terminated, it doesn't exist and as the manual page > > says "If id specifies a non-existent process or job, the return status > > is 127." > > > > What you're pointing out is that that creates a race condition when the > > subprocess ends before the "wait". And it seems that the kernel has > > enough information to tell "wait -n N", "process N doesn't exist, but > > you do have a queued termination report for it". But it's not clear > > that there's a way to ask the kernel for that information without > > reading all the queued termination reports (and losing the ability to > > return them for other "wait" calls). > > Thanks for the response, but I don't believe this is correct. > > Your understanding of the wait syscall is correct except that the exit > code and process information always remains available until the > process is awaited by its parent -- it is the wait syscall that itself > reaps the process and makes it unavailable to later searches by pid. > There is a possibility that the parent (bash in this case) might reap > the process in multiple ways (i.e., from different threads, setting > the SIGCHLD disposition to SIG_IGN, setting flat SA_NOCLDWAIT for the > SIGCHLD handler -- the last 2 from NOTES of man waitpid on linux) that > race with each other, but the parent is always given an opportunity to > read the exit code and reap the process if not disabled with SIGCHLD > handler configuration. > > My understanding of bash is that it internally maintains a queue/list > of finished child jobs to return such that wait -n mimics aspects of > the wait syscall. The discussion at > https://lists.gnu.org/archive/html/bug-bash/2023-05/msg00063.html > supports that bash "silently" reaps child processes and decouples the > wait syscall from the wait command. > > I assume it's possible to confirm that bash is awaiting the process > and retrieving the exit code via ptrace/strace but I'm unfamiliar with > these tools or bash logs. > > The test below allows the subprocess to complete normally, without > being signaled, and then successfully retrieves its exit code via wait > -n. This subprocess terminates before the call to wait -n. I see no > documented reason that a process terminating without signal prior to > wait -n should be returned while a process terminating with signal > prior to wait -n should not. > > echo "TEST: EXIT 0 PRIOR TO wait -n @${SECONDS}" > { sleep 1; echo "child finishing @${SECONDS}"; exit 1; } & > pid=$! > echo "child proc $pid @${SECONDS}" > > sleep 2 > wait -n $pid > echo "wait -n $pid return code $? @${SECONDS}" > > > For which I get output: > TEST:
Re: wait -n misses signaled subprocess
> In the first case, if the subprocess N has terminated, its report is > still queued and "wait" retrieves it. In the second case, if the > subprocess N has terminated, it doesn't exist and as the manual page > says "If id specifies a non-existent process or job, the return status > is 127." > > What you're pointing out is that that creates a race condition when the > subprocess ends before the "wait". And it seems that the kernel has > enough information to tell "wait -n N", "process N doesn't exist, but > you do have a queued termination report for it". But it's not clear > that there's a way to ask the kernel for that information without > reading all the queued termination reports (and losing the ability to > return them for other "wait" calls). Thanks for the response, but I don't believe this is correct. Your understanding of the wait syscall is correct except that the exit code and process information always remains available until the process is awaited by its parent -- it is the wait syscall that itself reaps the process and makes it unavailable to later searches by pid. There is a possibility that the parent (bash in this case) might reap the process in multiple ways (i.e., from different threads, setting the SIGCHLD disposition to SIG_IGN, setting flat SA_NOCLDWAIT for the SIGCHLD handler -- the last 2 from NOTES of man waitpid on linux) that race with each other, but the parent is always given an opportunity to read the exit code and reap the process if not disabled with SIGCHLD handler configuration. My understanding of bash is that it internally maintains a queue/list of finished child jobs to return such that wait -n mimics aspects of the wait syscall. The discussion at https://lists.gnu.org/archive/html/bug-bash/2023-05/msg00063.html supports that bash "silently" reaps child processes and decouples the wait syscall from the wait command. I assume it's possible to confirm that bash is awaiting the process and retrieving the exit code via ptrace/strace but I'm unfamiliar with these tools or bash logs. The test below allows the subprocess to complete normally, without being signaled, and then successfully retrieves its exit code via wait -n. This subprocess terminates before the call to wait -n. I see no documented reason that a process terminating without signal prior to wait -n should be returned while a process terminating with signal prior to wait -n should not. echo "TEST: EXIT 0 PRIOR TO wait -n @${SECONDS}" { sleep 1; echo "child finishing @${SECONDS}"; exit 1; } & pid=$! echo "child proc $pid @${SECONDS}" sleep 2 wait -n $pid echo "wait -n $pid return code $? @${SECONDS}" For which I get output: TEST: EXIT 0 PRIOR TO wait -n @0 child proc 2270 @0 child finishing @1 wait -n 2270 return code 1 @2 Steve
Re: wait -n misses signaled subprocess
Steven Pelley writes: > wait -n > fails to return for processes that terminate due to a signal prior to > calling wait -n. Instead, it returns 127 with an error that the > process id cannot be found. Calling wait (without -n) then > returns its exit code (e.g., 143). My understanding is that this is how "wait" is expected to work, or at least known to work, but mostly because that's how the *kernel* works. "wait" without -n makes a system call which means "give me information about a terminated subprocess". The termination (or perhaps change-of-state) reports from subprocesses are queued up in the kernel until the process retrieves them through "wait" system calls. OTOH, "wait" with -n makes a system call which means "give me information about my subprocess N". In the first case, if the subprocess N has terminated, its report is still queued and "wait" retrieves it. In the second case, if the subprocess N has terminated, it doesn't exist and as the manual page says "If id specifies a non-existent process or job, the return status is 127." What you're pointing out is that that creates a race condition when the subprocess ends before the "wait". And it seems that the kernel has enough information to tell "wait -n N", "process N doesn't exist, but you do have a queued termination report for it". But it's not clear that there's a way to ask the kernel for that information without reading all the queued termination reports (and losing the ability to return them for other "wait" calls). Then again, I might be wrong. Dale
wait -n misses signaled subprocess
Hello, I've encountered what I believe is a bug in bash's "wait -n". wait -n fails to return for processes that terminate due to a signal prior to calling wait -n. Instead, it returns 127 with an error that the process id cannot be found. Calling wait (without -n) then returns its exit code (e.g., 143). I expect wait -n to return each process through successive calls to wait -n, which is the case for processes that terminate in other manners even prior to calling wait -n. Killing a process while the wait -n is actively blocking works correctly. Test script at bottom. The specific situation I encountered this is when trying to coordinate my own cooperative exit and handling/propagating SIGTERM. If I propagate this SIGTERM by killing multiple processes at once (kill pid1 pid2 pid3 ...) the next call to wait -n will return 143 and indicate a pid (via -p) but the next call to wait -n returns 127 as all processes previously terminated. If any of the awaited processes haven't yet terminated then you only discover the previously-killed process whenever the next terminates. I have workarounds/I'm not blocked but this seems a reasonable use case and worth sharing. I've tried: killing with SIGTERM and SIGALRM killing from the test script, a subshell, and another terminal. I don't believe this is related to kill being a builtin. enabling job control (set -m) bash versions 4.4.12, 5.2.15, 5.2.21. All linux arm64 Test script: # change to test other signals sig=TERM echo "TEST: KILL PRIOR TO wait -n @${SECONDS}" { sleep 1; exit 1; } & pid=$! echo "kill -$sig $pid @${SECONDS}" kill -$sig $pid sleep 2 wait -n $pid echo "wait -n $pid return code $? @${SECONDS} (BUG)" wait $pid echo "wait $pid return code $? @${SECONDS}" echo "TEST: KILL DURING wait -n @${SECONDS}" { sleep 2; exit 1; } & pid=$! { sleep 1; echo "kill -$sig $pid @${SECONDS}"; kill -$sig $pid; } & wait -n $pid echo "wait -n $pid return code $? @${SECONDS}" wait $pid echo "wait $pid return code $? @${SECONDS}" For which I get the following example output: TEST: KILL PRIOR TO wait -n @0 kill -TERM 1384 @0 ./test.sh: line 14: 1384 Terminated { sleep 1; exit 1; } wait -n 1384 return code 127 @2 (BUG) wait 1384 return code 143 @2 TEST: KILL DURING wait -n @2 kill -TERM 1402 @3 ./test.sh: line 25: 1402 Terminated { sleep 2; exit 1; } wait -n 1402 return code 143 @3 wait 1402 return code 143 @3 I expect the line ending (BUG) to indicate a return code of 143. Thanks, Steve Pelley