Hi,
We are facing an issue with ksh - script hangs taking 100% CPU time.
This issue cannot be reproduced at will. Usually it takes about 3 weeks for
this problem to occur. Chances are bigger if the system is under heavy load.
In backtrace we can find this:
Core was generated by `/bin/ksh'.
#0 0x0000000000414359 in job_chksave (pid=0)
at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1653
1653 if(jp->pid==pid)
(gdb) bt
#0 0x0000000000414359 in job_chksave (pid=0)
at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1653
#1 0x000000000041451c in job_unpost (pwtop=<value optimized out>,
notify=<value optimized out>)
at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1523
#2 0x0000000000415a97 in job_wait (pid=0)
at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1392
#3 0x0000000000434eef in sh_exec (t=0xacadc00, flags=4)
at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1087
#4 0x00000000004368ea in sh_exec (t=0xacadd20, flags=<value optimized out>)
at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1218
#5 0x0000000000435f2c in sh_exec (t=0xad1f5a0, flags=4)
at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1313
#6 0x0000000000435dd9 in sh_exec (t=0xad1f5a0, flags=181532064)
at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1596
#7 0x0000000000435db3 in sh_exec (t=0xad19060, flags=4)
at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1332
#8 0x00000000004350ec in sh_exec (t=0xacad410, flags=36)
at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1531
#9 0x00000000004076b3 in exfile ()
#10 0x0000000000406bbc in sh_main ()
#11 0x000000363281d8a4 in __libc_start_main (main=0x405f80 <main>, argc=2,
ubp_av=0x7fff6eca28a8, init=<value optimized out>,
fini=<value optimized out>, rtld_fini=<value optimized out>,
stack_end=0x7fff6eca2898) at libc-start.c:231
#12 0x0000000000405ec9 in _start ()
(gdb) l
1648 {
1649 register struct jobsave *jp = bck.list, *jpold=0;
1650 register int r= -1;
1651 while(jp)
1652 {
1653 if(jp->pid==pid)
1654 break;
1655 if(pid==0 && !jp->next)
1656 break;
1657 jpold = jp;
(gdb) p jp
$1 = (struct jobsave *) 0xad1f5a0
(gdb) p jp->next
$2 = (struct jobsave *) 0xad1f5a0
(gdb) p jpold
$3 = (struct jobsave *) 0xad1f5a0
(gdb) p jp->pid
$4 = 10156
(gdb) p pid
$5 = 0
The problem (in short) is job save (linear) list which becomes circular -
jp->next = jp
(Looks similar to
https://mailman.research.att.com/pipermail/ast-users/2004q2/000535.html )
We were able to reproduce something similar with lowering values for 'open
files' and 'max user processes' via ulimit -n and ulimit -u and executing:
original_script &
# and other script containing something like this:
I=0; while true; do cat </dev/random >out.$I; I=$((I+1)); done
After depleting resources, script was still trying to fork new process and
waiting... after killing this script ksh ended in job_chksave in while loop
with circular list. This was reproduced with ksh 2006-02-14, with 2008-02-02 we
weren't able to reproduce it this way, but it's able to "reproduce" itself in
normal system after about 3 weeks.
("original_script" has about 3000 lines and is executing another script with
about 3000 lines in a background and with about 3 weeks to reproduce this, it
makes impossible for us to get small and fast reproducer out of it)
I was looking at diff between 20080202:jobs.c and 20081104:jobs.c and there
seems to be only very small change, so I assume no similar problem has been
fixed since 2008-02-02.
Do you have any idea if it's possible to reproduce or test this issue easily?
Or has been something similar fixed recently?
Thanks for any help.
Regards,
Michal Hlavinka
_______________________________________________
ast-developers mailing list
[email protected]
https://mailman.research.att.com/mailman/listinfo/ast-developers