Re: `wait -n` returns 127 when it shouldn't

2023-05-22 Thread Robert Elz
Date:Mon, 22 May 2023 02:43:18 -0300
From:Aleksey Covacevice 
Message-ID:  



  | I fail to see where the race condition in `true & wait -n` is.

It wasn't in the script - but in the implementation inside bash.
Chet has (apparently, I don't look at bash sources, and only
run released versions & patches) fixed it already.

Any further discussion of this issue is just a waste of everyone's time
(unless you have obtained the fixed version and discover further problems,
naturally).

kre





Re: `wait -n` returns 127 when it shouldn't

2023-05-22 Thread alex xmb ratchev
On Mon, May 22, 2023, 07:43 Aleksey Covacevice 
wrote:

> On Thu, May 18, 2023 at 3:07 PM Chet Ramey  wrote:
> >
> > On 5/18/23 7:51 AM, Robert Elz wrote:
> >
> > > Apparently, in bash, if the code is running in a (shell) loop (like
> inside
> > > a while, or similar, loop) then each iteration around the loop, any
> jobs that
> > > have exited, but not been cleaned already, are removed from the queue
> (the
> > > jobs table in practice, though bash may also have something else).
> > >
> > > That's really broken, and should be fixed (but has apparently been that
> > > way for decades, and no-one noticed).
> >
> > This isn't a problem, and is a red herring. The code that manages that
> list
> > makes sure to keep as many jobs in the list as POSIX requires, subject to
> > the maxchild resource limit.
> >
> >
> > > In the script in question, the offending loop isn't the one in the main
> > > program - in that for each iteration the background processes are
> started,
> > > and waited for, in each iteration, but the one in the waitjobs
> function.
> > > which (appears at first glance, which is all the analysis shells ever
> do)
> > > to be an infinite loop, so each time around, if there are any completed
> > > jobs in the table, they're removed.
> >
> > No, this isn't what happens. The problem is that the shell reaps both
> jobs,
> > but the `wait -n' code had a race condition that prevented it from
> finding
> > a job in the list.
> >
>
> I fail to see where the race condition in `true & wait -n` is. Whether the
> 'underlying function' has a race condition is the true red herring here.
>

maybe the , race condition , is & bash code not registering the new thread
fast enuff

Also, the manual states:
>
> 'If the -n option is supplied, wait waits for a single job from the
> list of ids or,
> if no ids are supplied, any job, to complete and returns its exit status.'
>
> `true & wait -n` returning 127 means `wait -n` did not wait for 'any job';
> in fact, it waited for no job. The subsequent part about
> 'unwaited-for' children is
> either irrelevant or contradictory to the above given the current scenario.
>
> > --
> > ``The lyf so short, the craft so long to lerne.'' - Chaucer
> >  ``Ars longa, vita brevis'' - Hippocrates
> > Chet Ramey, UTech, CWRUc...@case.edu
> http://tiswww.cwru.edu/~chet/
> >
>
>


Re: `wait -n` returns 127 when it shouldn't

2023-05-21 Thread Aleksey Covacevice
On Thu, May 18, 2023 at 3:07 PM Chet Ramey  wrote:
>
> On 5/18/23 7:51 AM, Robert Elz wrote:
>
> > Apparently, in bash, if the code is running in a (shell) loop (like inside
> > a while, or similar, loop) then each iteration around the loop, any jobs 
> > that
> > have exited, but not been cleaned already, are removed from the queue (the
> > jobs table in practice, though bash may also have something else).
> >
> > That's really broken, and should be fixed (but has apparently been that
> > way for decades, and no-one noticed).
>
> This isn't a problem, and is a red herring. The code that manages that list
> makes sure to keep as many jobs in the list as POSIX requires, subject to
> the maxchild resource limit.
>
>
> > In the script in question, the offending loop isn't the one in the main
> > program - in that for each iteration the background processes are started,
> > and waited for, in each iteration, but the one in the waitjobs function.
> > which (appears at first glance, which is all the analysis shells ever do)
> > to be an infinite loop, so each time around, if there are any completed
> > jobs in the table, they're removed.
>
> No, this isn't what happens. The problem is that the shell reaps both jobs,
> but the `wait -n' code had a race condition that prevented it from finding
> a job in the list.
>

I fail to see where the race condition in `true & wait -n` is. Whether the
'underlying function' has a race condition is the true red herring here.

Also, the manual states:

'If the -n option is supplied, wait waits for a single job from the
list of ids or,
if no ids are supplied, any job, to complete and returns its exit status.'

`true & wait -n` returning 127 means `wait -n` did not wait for 'any job';
in fact, it waited for no job. The subsequent part about
'unwaited-for' children is
either irrelevant or contradictory to the above given the current scenario.

> --
> ``The lyf so short, the craft so long to lerne.'' - Chaucer
>  ``Ars longa, vita brevis'' - Hippocrates
> Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
>



Re: `wait -n` returns 127 when it shouldn't

2023-05-19 Thread Chet Ramey

On 5/19/23 6:24 AM, Robert Elz wrote:

 Date:Thu, 18 May 2023 14:07:32 -0400
 From:Chet Ramey 
 Message-ID:  

   | This isn't a problem, and is a red herring. The code that manages that list
   | makes sure to keep as many jobs in the list as POSIX requires, subject to
   | the maxchild resource limit.

That is good.

Everyone else be aware that my previous response was based upon (rather
incomplete) data from a message Chet sent me off list (because I sent him
one off list, by accident really - the off list part).


I initially suspected the problem was with that code, and spent some time
chasing it before looking at the `wait -n' code itself and finding the
problem there.


--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: `wait -n` returns 127 when it shouldn't

2023-05-19 Thread Robert Elz
Date:Thu, 18 May 2023 14:07:32 -0400
From:Chet Ramey 
Message-ID:  

  | This isn't a problem, and is a red herring. The code that manages that list
  | makes sure to keep as many jobs in the list as POSIX requires, subject to
  | the maxchild resource limit.

That is good.

Everyone else be aware that my previous response was based upon (rather
incomplete) data from a message Chet sent me off list (because I sent him
one off list, by accident really - the off list part).

Bugs getting fixed is always a good thing.

kre




Re: `wait -n` returns 127 when it shouldn't

2023-05-18 Thread Chet Ramey

On 5/18/23 7:51 AM, Robert Elz wrote:


Apparently, in bash, if the code is running in a (shell) loop (like inside
a while, or similar, loop) then each iteration around the loop, any jobs that
have exited, but not been cleaned already, are removed from the queue (the
jobs table in practice, though bash may also have something else).

That's really broken, and should be fixed (but has apparently been that
way for decades, and no-one noticed).


This isn't a problem, and is a red herring. The code that manages that list
makes sure to keep as many jobs in the list as POSIX requires, subject to
the maxchild resource limit.



In the script in question, the offending loop isn't the one in the main
program - in that for each iteration the background processes are started,
and waited for, in each iteration, but the one in the waitjobs function.
which (appears at first glance, which is all the analysis shells ever do)
to be an infinite loop, so each time around, if there are any completed
jobs in the table, they're removed. 


No, this isn't what happens. The problem is that the shell reaps both jobs,
but the `wait -n' code had a race condition that prevented it from finding
a job in the list.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: `wait -n` returns 127 when it shouldn't

2023-05-18 Thread Chet Ramey

On 5/18/23 12:16 AM, Martin D Kealey wrote:

If there is silent reaping going on (other than “wait -n” or “trap ... 
SIGCHLD”) shouldn't the exit status and pid of each silently reaped process 
be retained in a queue that “wait -n“ can extract from, in order to 
maintain the reasonable expected semantics? Arguably this queue should be 
shared with “fg” when job control is enabled.


There is always `silent reaping' going on. This is a red herring.

The problem was a race condition in the `wait -n' code.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: `wait -n` returns 127 when it shouldn't

2023-05-18 Thread Robert Elz
Date:Thu, 18 May 2023 07:35:35 -0400
From:Greg Wooledge 
Message-ID:  

  | I'm fairly sure most (or all?) shells do this, not just bash.

Interactive shells are "different" from those running a script in
this regard.

kre




Re: `wait -n` returns 127 when it shouldn't

2023-05-18 Thread Robert Elz
Date:Thu, 18 May 2023 14:16:17 +1000
From:Martin D Kealey 
Message-ID:  


  | I know that some platforms (used to?) lack all of the “waitpid()”,

This is irrelevant to the issue at hand (and in general, for shells, is
irrelevant anyway, as shells usually always clean up the process table as
soon as possible, always waiting for anything.   Lack of anything more than
simple wait() can be problematic, as that hangs, which isn't always desired,
but in combination with SIGCHLD (as abominable as that signal is defined to
work on some systems) can be made to function.

But not relevant here, the script is just doing wait -n (no specific pid
requested) and hence there's no need for anything fancy in terms of wait
sys called.

  | If there is silent reaping going on (other than “wait -n” or “trap ...
  | SIGCHLD”)

In practice, there always is, in all shells.

  | shouldn't the exit status and pid of each silently reaped process
  | be retained in a queue that “wait -n“ can extract from,

Yes, that is what is supposed to happen.   And does.   The question is
when jobs are removed from that queue.

  | Would you care to speculate more precisely on where such silent reaping may
  | occur, given the code as shown?

Apparently, in bash, if the code is running in a (shell) loop (like inside
a while, or similar, loop) then each iteration around the loop, any jobs that
have exited, but not been cleaned already, are removed from the queue (the
jobs table in practice, though bash may also have something else).

That's really broken, and should be fixed (but has apparently been that
way for decades, and no-one noticed).

The intent is to avoid the queue growing infinitely big in the case of
loops like

while :; do process& maybe other code but not doing wait; done

Note this does not need to be a very speedy loop, just one that runs
forever, and never cleans anything up.   That's broken, but in old shell
scripts, hard to avoid, as the only cleanup method was a simple "wait" which
would wait until all background processes completed, defeating the purpose.

In the script in question, the offending loop isn't the one in the main
program - in that for each iteration the background processes are started,
and waited for, in each iteration, but the one in the waitjobs function.
which (appears at first glance, which is all the analysis shells ever do)
to be an infinite loop, so each time around, if there are any completed
jobs in the table, they're removed.   Then, if nothing is still running,
wait -n returns 127, and we exit.   If we're lucky, we get to the wait -n
before the false job finishes, and wait -n collects that one (what happens
to the background true is completely irrelevant to this script), and
everything iterates.   If we're unlucky, false has already completed, and
its status is lost, before we get a chance to wait for it.

Simply broken.

What bash should be doing is limiting the number of jobs that can be in
the jobs table (to perhaps a few hundred) - deleting the oldest completed
ones if more jobs need to be added.   That's allowed, solves the infinite
new job problem, and allows sane programs that do wait for their children
to avoid this kind of issue.

  | PS: I'm not convinced that “trap ... SIGCHLD” needs to be in that list;

No, shell level SIGCHLD traps are irrelevant.The semantics of SIGCHLD
means that they can't rationally be mapped directly from SIGCHLD signals,
those things are hopeless and need to be handled specially by the shell
(or always kept at SIG_DFL so they never occur) or things fail badly.

kre




Re: `wait -n` returns 127 when it shouldn't

2023-05-18 Thread Greg Wooledge
On Thu, May 18, 2023 at 02:16:17PM +1000, Martin D Kealey wrote:
> If there is silent reaping going on (other than “wait -n” or “trap ...
> SIGCHLD”) [...]

Yes, bash silently reaps child processes.

unicorn:~$ tty
/dev/pts/2
unicorn:~$ sleep 5 & sleep 7 &
[1] 942813
[2] 942814


unicorn:~$ tty
/dev/pts/0
unicorn:~$ ps -ft pts/2
UID  PIDPPID  C STIME TTY  TIME CMD
greg 973 959  0 Apr29 pts/200:00:00 bash
greg  942813 973  0 07:29 pts/200:00:00 sleep 5
greg  942814 973  0 07:29 pts/200:00:00 sleep 7
unicorn:~$ ps -ft pts/2
UID  PIDPPID  C STIME TTY  TIME CMD
greg 973 959  0 Apr29 pts/200:00:00 bash


I didn't touch pts/2 at all during this time.  I just ran the ps commands
on pts/0.  As you can see, the two sleep processes are just *gone*.  They
are not hanging around as zombies waiting for me to do something on
pts/2.  At no point did I ever call "wait" explicitly.

I'm fairly sure most (or all?) shells do this, not just bash.



Re: `wait -n` returns 127 when it shouldn't

2023-05-17 Thread Martin D Kealey
On Thu, 18 May 2023 at 02:13, Chet Ramey  wrote:

> It's possible for the shell to reap both background jobs before `wait -n'
> is called. The underlying function returns < 0 when there aren't any
> unwaited-for jobs, which the wait builtin translates to 127.
>

I know that some platforms (used to?) lack all of the “waitpid()”,
“wait3()”, “wait4()”, and “waitid()” syscalls. On those you need to use
“wait()” repeatedly until you get the PID the script asked for, and keep
track of the others until the script asks for them too. At least, this is
what Perl and MSys did when running on older Windows.

However Linux has all 5 reaping syscalls available, and can provide the
exit status to a signal handler (in the siginfo parameter) without calling
any of them, and therefore without *actually* reaping the process.

If there is silent reaping going on (other than “wait -n” or “trap ...
SIGCHLD”) shouldn't the exit status and pid of each silently reaped process
be retained in a queue that “wait -n“ can extract from, in order to
maintain the reasonable expected semantics? Arguably this queue should be
shared with “fg” when job control is enabled.

Would you care to speculate more precisely on where such silent reaping may
occur, given the code as shown?

-Martin

PS: I'm not convinced that “trap ... SIGCHLD” needs to be in that list;
it's the “wait” inside the trap that actually matters, and if you *don't*
“wait” inside a SIGCHLD trap, things are going to get quite strange anyway.


Re: `wait -n` returns 127 when it shouldn't

2023-05-17 Thread Chet Ramey

On 5/16/23 1:35 PM, Aleksey Covacevice wrote:


Bash Version: 5.1
Patch Level: 16
Release Status: release

Description:
`wait -n` sometimes returns with status code 127 even though there are
unwaited-for children.


There are not. That's why `wait -n' returns 127.



Repeat-By:
The following script does finish after a while:

waitjobs() {
 local status=0
 while true; do
 local code=0; wait -n || code=$?
 ((code == 127)) && break
 ((!code)) || status=$code
 done
 return $status
}

# Eventually finishes:
while true; do (
 true &
 false &
 waitjobs
) && break; done


It's possible for the shell to reap both background jobs before `wait -n'
is called. The underlying function returns < 0 when there aren't any
unwaited-for jobs, which the wait builtin translates to 127.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: `wait -n` returns 127 when it shouldn't

2023-05-17 Thread Phi Debian
On Wed, May 17, 2023 at 12:21 PM Oğuz İsmail Uysal <
oguzismailuy...@gmail.com> wrote:

>
> This boils down to the following
>
>  true &
>  false &
>  wait -n
>
> There is no guarantee that `wait -n' will report the status of `true',
> the shell may acquire the status of `false' first. It's not a bug
>

Ok for the randomness of result yet the $? should be 0 or 1 never 127 as
the OP asked ? did I miss something?


Re: `wait -n` returns 127 when it shouldn't

2023-05-17 Thread Robert Elz
Date:Wed, 17 May 2023 17:23:21 +1000
From:Martin D Kealey 
Message-ID:  



  | I suspect putting "local" in a loop is doing something strange.

"local" is an executable statement, not a declaration (shell really
has none of the latter) - every time it is executed it creates a new
local variable (which remains until the function exits, there are no
local scope rules in shell either).

That should make no difference to this code though, and the difference
you report likely hints at the source of the problem.

The code is written weirdly however, this sequence

code=0; wait -n || code=$?

could just be

wait -n; code=$?

(the "local" that might be there makes no difference, or
shouldn't, to the execution semantics).

Getting status==127 out of the waitjobs function should be impossible,
as it starts out being 0, and is only changed to $code if $code!=127
so if that ever happens, there looks to be a bug somewhere.

oguzismailuy...@gmail.com said:
  | There is no guarantee that `wait -n' will report the status of `true',  the
  | shell may acquire the status of `false' first.

That should be irrelevant, waitjobs() has a loop that explicitly waits
upon wait -n returning 127 (which it does not return to the caller, or
should not) which should mean that there are no children remaining.

Further, as long as waitjobs wait -n call actually reaps the exit from
false, it should always return with status==1 (the exit status from false).
Since false & true should both always be running in the bg when waitjobs
is called, the exit status from false should always (fairly quickly, since
it doesn't run for very long) be obtained, causing code==1 and hence status==1
(after which status will never be altered again as it isn't touched if
code==0 or code==127 which should be the only other 2 returns from wait -n).

I modified the script to get rid of the (()) usage and replace that with
the similar [ ] code which made no difference at all when executed under
bash, it still ends the outer loop, reasonably quickly.

But then I could run the script using the NetBSD shell, where it (seems to)
run forever (ie: it is still running - but forever hasn't been reached yet).

I think there is a bug, probably some race condition in bash with the jobs
table, causing the "false" job to get missed sometimes when running this code.
That allows status to remain 0, and the outer look to break, and the script
to terminate.

Mostly likely the use of "local" in the loop which caused the difference that
Martin noticed alters the timing somewhat to affect the race results.

kre




Re: `wait -n` returns 127 when it shouldn't

2023-05-17 Thread Oğuz İsmail Uysal

On 5/17/23 3:27 PM, Martin D Kealey wrote:


On Wed, 17 May 2023 at 20:20, Oğuz İsmail Uysal 
 wrote:


On 5/16/23 8:35 PM, Aleksey Covacevice wrote:

[original code elided as it's been mangled by line-wrapping]

This boils down to the following

 true &
 false &
 wait -n


With respect, I disagree with that statement of equivalence.

The only way for the loop to terminate is when `wait` returns 127, 
after both children have been reaped.
By when the non-zero exit status of "false" will have been noted, and 
then used as the return value of the function.

Must have misread then, thanks



Re: `wait -n` returns 127 when it shouldn't

2023-05-17 Thread Oğuz İsmail Uysal

On 5/16/23 8:35 PM, Aleksey Covacevice wrote:
waitjobs() { local status=0 while true; do local code=0; wait -n || 
code=$? ((code == 127)) && break ((!code)) || status=$code done return 
$status } # Eventually finishes: while true; do ( true & false & 
waitjobs ) && break; done 

This boils down to the following

    true &
    false &
    wait -n

There is no guarantee that `wait -n' will report the status of `true', 
the shell may acquire the status of `false' first. It's not a bug.




Re: `wait -n` returns 127 when it shouldn't

2023-05-17 Thread Martin D Kealey
On Wed, 17 May 2023 at 03:35, Aleksey Covacevice <
aleksey.covacev...@gmail.com> wrote:

> Description:
> `wait -n` sometimes returns with status code 127 even though there are
> unwaited-for children.
>
> Repeat-By:
> The following script does finish after a while:
>
> waitjobs() {
> local status=0
> while true; do
> local code=0; wait -n || code=$?
>

I put "local code" out of the loop and the problem went away (or at least
became extremely less likely).
I suspect putting "local" in a loop is doing something strange.




> ((code == 127)) && break
> ((!code)) || status=$code
> done
> return $status
> }
>
> # Eventually finishes:
> while true; do (
> true &
> false &
> waitjobs
> ) && break; done
>

I'm testing with Bash 5.1.4p47

-Martin


`wait -n` returns 127 when it shouldn't

2023-05-16 Thread Aleksey Covacevice
Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: linux-gnu
Compiler: gcc
Compilation CFLAGS: -march=x86-64 -mtune=generic -O2 -pipe -fno-plt
-fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat
-Werror=format-security -fstack-clash-protection
-fcf-protection -g
-ffile-prefix-map=/build/bash/src=/usr/src/debug/bash -flto=auto
-DDEFAULT_PATH_VALUE='/usr/local/sbin:/usr/local/bin:/usr/bin'
-DSTANDARD_UTILS_PATH='/usr/bin' -DSYS_BASHRC='/etc/bash.bashrc'
-DSYS_BASH_LOGOUT='/etc/bash.bash_logout'
-DNON_INTERACTIVE_LOGIN_SHELLS
uname output: Linux work 6.0.7-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 03
Nov 2022 18:01:58 + x86_64 GNU/Linux
Machine Type: x86_64-pc-linux-gnu

Bash Version: 5.1
Patch Level: 16
Release Status: release

Description:
`wait -n` sometimes returns with status code 127 even though there are
unwaited-for children.

Repeat-By:
The following script does finish after a while:

waitjobs() {
local status=0
while true; do
local code=0; wait -n || code=$?
((code == 127)) && break
((!code)) || status=$code
done
return $status
}

# Eventually finishes:
while true; do (
true &
false &
waitjobs
) && break; done