On Sat, Aug 24, 2013 at 7:09 PM, Cedric Blancher
<[email protected]> wrote:
> Are there any known issues where a SIGSTOP can trigger multiple
> SIGCHLD trap calls with code=STOPPED for the same event?
>
> We've experiencing trouble with this kind of problem, i.e. lack of
> SIGCHLD state change reports when a child changes from stopped to
> running or from running to stop, on a massive scale if the number of
> children exceeds a few hundred processor if the parent process is
> stalled by paging/swapping.
>
> I can't reproduce it with a simple testcase but while searching I once
> had this failure:
> ksh -x -c 'builtin pids ; integer numsigchld=0 ; trap "print -v
> .sh.sig;((numsigchld++))" CHLD ; { while true ; do kill -s STOP $(pids
> -f "%(pid)d") ; done } & pid=$! ; sleep 1 ; kill -CONT $pid ;
> /usr/bin/sleep 1; kill -KILL $pid ; wait $pid ; print
> "$?,${numsigchld}"'
> + builtin pids
> + numsigchld=0
> + typeset -li numsigchld
> + trap 'print -v .sh.sig;((numsigchld++))' CHLD
> + pid=26972
> + sleep 1
> + true
> + pids -f '%(pid)d'
> + kill -s STOP 26972
> + print -v .sh.sig
> (
>         typeset -r -l -i 16 addr=16#3e80000695c
>         typeset -r -l -i band=0
>         typeset -r code=STOPPED
>         typeset -r -i errno=0
>         typeset -r name=CHLD
>         typeset -r -i pid=26972
>         typeset -r -i signo=17
>         typeset -r -i status=19
>         typeset -r -i uid=231713
>         value=(
>                 typeset -r -i int=19
>                 typeset -r -l -i 16 ptr=16#13
>         )
> )
> + ((numsigchld++))
> + kill -CONT 26972
> + true
> + pids -f '%(pid)d'
> + kill -s STOP 26972
> + print -v .sh.sig
> (
>         typeset -r -l -i 16 addr=16#3e80000695c
>         typeset -r -l -i band=0
>         typeset -r code=CONTINUED
>         typeset -r -i errno=0
>         typeset -r name=CHLD
>         typeset -r -i pid=26972
>         typeset -r -i signo=17
>         typeset -r -i status=0
>         typeset -r -i uid=231713
>         value=(
>                 typeset -r -i int=0
>                 typeset -r -l -i 16 ptr=16#0
>         )
> )
> + ((numsigchld++))
> + /usr/bin/sleep 1
> ./arch/linux.i386-64/bin/ksh: 26972: Stopped (SIGSTOP)
> + print -v .sh.sig
> (
>         typeset -r -l -i 16 addr=16#3e80000695c
>         typeset -r -l -i band=0
>         typeset -r code=STOPPED
>         typeset -r -i errno=0
>         typeset -r name=CHLD
>         typeset -r -i pid=26972
>         typeset -r -i signo=17
>         typeset -r -i status=19
>         typeset -r -i uid=231713
>         value=(
>                 typeset -r -i int=19
>                 typeset -r -l -i 16 ptr=16#13
>         )
> )
> + ((numsigchld++))
> + print -v .sh.sig
> (
>         typeset -r -l -i 16 addr=16#3e80000695c
>         typeset -r -l -i band=0
>         typeset -r code=STOPPED
>         typeset -r -i errno=0
>         typeset -r name=CHLD
>         typeset -r -i pid=26972
>         typeset -r -i signo=17
>         typeset -r -i status=19
>         typeset -r -i uid=231713
>         value=(
>                 typeset -r -i int=19
>                 typeset -r -l -i 16 ptr=16#13
>         )
> )
> + ((numsigchld++))
> + kill -KILL 26972
> + print -v .sh.sig
> (
>         typeset -r -l -i 16 addr=16#3e80000695c
>         typeset -r -l -i band=0
>         typeset -r code=KILLED
>         typeset -r -i errno=0
>         typeset -r name=CHLD
>         typeset -r -i pid=26972
>         typeset -r -i signo=17
>         typeset -r -i status=9
>         typeset -r -i uid=231713
>         value=(
>                 typeset -r -i int=9
>                 typeset -r -l -i 16 ptr=16#9
>         )
> )
> + ((numsigchld++))
> + wait 26972
> + print 265,5
> 265,5
>
> SIGCHLD trap was called twice for the STOP signal and the total count
> of signals is 5 (numsigchld=5) instead of 4.

We can reproduce the problem. I think I can explain the issue: ksh93
doesn't use the siginfo data created by the kernel to manage job
control, instead it polls all jobs for data and generates artificial
siginfo structures to use for .sh.sig. Of course that works only if
the process doing the job management is fast enough and the managed
children don't change state faster than the parent process can poll.

IMO a fix would be to use the SIGCHLD siginfo data created by the
kernel for job management, i.e. kernel creates SIGCHLD siginfo data,
they get queued in the sh_fault() trap handler like other siginfo
data, then processed outside the trap handler and update the internal
jobs structure, call .sh.sig machinery (if trap "..." CHLD is active)
and then dispose the data after that.
This would solve your problem and also solve the performance problem
which occurs if you poll thousands of child processes for state
changes.

Irek
_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Reply via email to