[ast-developers] script hangs taking 100% CPU time

Michal Hlavinka Fri, 14 Nov 2008 04:14:00 -0800

Hi,

We are facing an issue with ksh - script hangs taking 100% CPU time.


This issue cannot be reproduced at will. Usually it takes about 3 weeks for 
this problem to occur. Chances are bigger if the system is under heavy load.

In backtrace we can find this:

Core was generated by `/bin/ksh'.
#0  0x0000000000414359 in job_chksave (pid=0)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1653
1653 if(jp->pid==pid)
(gdb) bt
#0  0x0000000000414359 in job_chksave (pid=0)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1653
#1  0x000000000041451c in job_unpost (pwtop=<value optimized out>,
   notify=<value optimized out>)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1523
#2  0x0000000000415a97 in job_wait (pid=0)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/jobs.c:1392
#3  0x0000000000434eef in sh_exec (t=0xacadc00, flags=4)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1087
#4  0x00000000004368ea in sh_exec (t=0xacadd20, flags=<value optimized out>)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1218
#5  0x0000000000435f2c in sh_exec (t=0xad1f5a0, flags=4)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1313
#6  0x0000000000435dd9 in sh_exec (t=0xad1f5a0, flags=181532064)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1596
#7  0x0000000000435db3 in sh_exec (t=0xad19060, flags=4)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1332
#8  0x00000000004350ec in sh_exec (t=0xacad410, flags=36)
   at /usr/src/debug/ksh-20080202/src/cmd/ksh93/sh/xec.c:1531
#9  0x00000000004076b3 in exfile ()
#10 0x0000000000406bbc in sh_main ()
#11 0x000000363281d8a4 in __libc_start_main (main=0x405f80 <main>, argc=2,
   ubp_av=0x7fff6eca28a8, init=<value optimized out>,
   fini=<value optimized out>, rtld_fini=<value optimized out>,
   stack_end=0x7fff6eca2898) at libc-start.c:231
#12 0x0000000000405ec9 in _start ()
(gdb) l
1648 {
1649 register struct jobsave *jp = bck.list, *jpold=0;
1650 register int r= -1;
1651 while(jp)
1652 {
1653 if(jp->pid==pid)
1654 break;
1655 if(pid==0 && !jp->next)
1656 break;
1657 jpold = jp;
(gdb) p jp
$1 = (struct jobsave *) 0xad1f5a0
(gdb) p jp->next
$2 = (struct jobsave *) 0xad1f5a0
(gdb) p jpold
$3 = (struct jobsave *) 0xad1f5a0
(gdb) p jp->pid
$4 = 10156
(gdb) p pid
$5 = 0


The problem (in short) is job save (linear) list which becomes circular - 
jp->next = jp

(Looks similar to 
https://mailman.research.att.com/pipermail/ast-users/2004q2/000535.html )

We were able to reproduce something similar with lowering values for 'open 
files' and 'max user processes' via ulimit -n and ulimit -u and executing:
original_script &
# and other script containing something like this:
I=0; while true; do cat </dev/random >out.$I; I=$((I+1)); done

After depleting resources, script was still trying to fork new process and 
waiting... after killing this script ksh ended in job_chksave in while loop 
with circular list. This was reproduced with ksh 2006-02-14, with 2008-02-02 we 
weren't able to reproduce it this way, but it's able to "reproduce" itself in 
normal system after about 3 weeks.

("original_script" has about 3000 lines and is executing another script with 
about 3000 lines in a background and with about 3 weeks to reproduce this, it 
makes impossible for us to get small and fast reproducer out of it)

I was looking at diff between 20080202:jobs.c and 20081104:jobs.c and there 
seems to be only very small change, so I assume no similar problem has been 
fixed since 2008-02-02.

Do you have any idea if it's possible to reproduce or test this issue easily? 
Or has been something similar fixed recently? 

Thanks for any help.

Regards,

Michal Hlavinka
_______________________________________________
ast-developers mailing list
[email protected]
https://mailman.research.att.com/mailman/listinfo/ast-developers

[ast-developers] script hangs taking 100% CPU time

Reply via email to