----- Original Message ----- > From: "Jan Stancek" <[email protected]> > To: [email protected] > Cc: "Jeffrey Burke" <[email protected]> > Sent: Friday, 30 November, 2012 3:37:03 PM > Subject: [LTP] clone03/06 randomly crashing > > Hi, > > I'm occasionally getting core files from clone03/clone06 testcases. > The testcase itself gives PASS, it is the child which is randomly > crashing. > It seems to occur more on single cpu systems. > > For example: > Core was generated by `clone03'. > Program terminated with signal 11, Segmentation fault. > #0 0x0000000000402bfd in tst_print (tcid=0x403d0e "clone03", tnum=1, > ttype=2, > tmesg=0x14c6070 "unexpected signal 15 received (pid = 17427).") > at tst_res.c:412 > 412 { > (gdb) bt > #0 0x0000000000402bfd in tst_print (tcid=0x403d0e "clone03", tnum=1, > ttype=2, > tmesg=0x14c6070 "unexpected signal 15 received (pid = 17427).") > at tst_res.c:412 > #1 0x00000000004031be in tst_res (ttype=2, fname=<value optimized > out>, arg_fmt=<value optimized out>) at tst_res.c:316 > #2 0x0000000000403761 in tst_brk (ttype=2, fname=0x0, func=0x4013d0 > <cleanup>, arg_fmt=<value optimized out>) at tst_res.c:640 > #3 0x0000000000403960 in tst_brkm (ttype=2, func=0x4013d0 <cleanup>, > arg_fmt=<value optimized out>) at tst_res.c:698 > #4 0x0000000000403b45 in def_handler (sig=15) at tst_sig.c:248 > #5 <signal handler called> > #6 0x00000037940db650 in __write_nocancel () at > ../sysdeps/unix/syscall-template.S:82 > #7 0x000000000040169e in child_fn () at clone03.c:208 > #8 0x00000037940e890d in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 > > Dump of assembler code for function tst_print: > 0x0000000000402bd0 <+0>: mov %rbx,-0x30(%rsp) > 0x0000000000402bd5 <+5>: mov %rbp,-0x28(%rsp) > 0x0000000000402bda <+10>: mov %edx,%ebx > 0x0000000000402bdc <+12>: mov %r12,-0x20(%rsp) > 0x0000000000402be1 <+17>: mov %r13,-0x18(%rsp) > 0x0000000000402be6 <+22>: mov %rdi,%r12 > 0x0000000000402be9 <+25>: mov %r14,-0x10(%rsp) > 0x0000000000402bee <+30>: mov %r15,-0x8(%rsp) > 0x0000000000402bf3 <+35>: sub $0x2858,%rsp > 0x0000000000402bfa <+42>: mov %esi,%r14d > => 0x0000000000402bfd <+45>: mov %rcx,0x18(%rsp) > > (gdb) p $rsp > $1 = (void *) 0x14c3800 > (gdb) x/1x $rsp > 0x14c3800: Cannot access memory at address 0x14c3800 > > It looks like it receives SIGTERM and while handling SIGTERM it hits > SIGSEGV. > I don't know what is source of that SIGTERM. I was looking into the > second part > and looks like the stack for child is not large enough. > > I modified clone03.c (see attached clone03_poison.patch) to get some > extra > empty buffer before the child's stack, which was set to pattern 0xDE. > > Before: > |-------------------------------| > child_stack > child_stack+CHILD_STACK_SIZE > After: > |---------------------|-------------------------------| > poision_start child_stack > child_stack+CHILD_STACK_SIZE > > Now if I start clone03 and kill it I can randomly reproduce the > SIGSEGV (attached clone03_kill.sh). > The backtrace usually looks like: > ... (random place) > #5 0x000000000040324e in tst_res (ttype=2, fname=<value optimized > out>, arg_fmt=<value optimized out>) at tst_res.c:316 > #6 0x00000000004037f1 in tst_brk (ttype=2, fname=0x0, func=0x401420 > <cleanup>, arg_fmt=<value optimized out>) at tst_res.c:640 > #7 0x00000000004039f0 in tst_brkm (ttype=2, func=0x401420 <cleanup>, > arg_fmt=<value optimized out>) at tst_res.c:698 > #8 0x0000000000403bd5 in def_handler (sig=13) at tst_sig.c:248 > #9 <signal handler called> > #10 0x0000003327cdb650 in __write_nocancel () at > ../sysdeps/unix/syscall-template.S:82 > #11 0x000000000040172e in child_fn () at clone03.c:212 > #12 0x0000003327ce890d in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 > > (gdb) p poison_start > $1 = (void *) 0xa02010 > (gdb) p child_stack > $2 = (void *) 0xa03010 > > (gdb) x/16x poison_start > 0xa02010: 0xdededede 0xdededede 0xdededede 0xdededede > 0xa02020: 0xdededede 0xdededede 0xdededede 0xdededede > 0xa02030: 0xdededede 0xdededede 0xdededede 0xdededede > 0xa02040: 0xdededede 0xdededede 0xdededede 0xdededede > ... > (gdb) > 0xa02490: 0xdededede 0xdededede 0xdededede 0xdededede > 0xa024a0: 0x00000018 0x00000030 0x00a02800 0x00000000 > 0xa024b0: 0x00a02740 0x00000000 0xdededede 0xdededede > 0xa024c0: 0xdededede 0xdededede 0x27409296 0x00000033 > > The above shows that 0xDE pattern has been overwritten. > > Extending child stack helps with the second part: SIGSEGV > #define CHILD_STACK_SIZE 16384*4 > but I have no idea, where is that first SIGTERM coming from. Any > ideas?
It appears to be ltp-pan, which sees the child as orphan. When I added "-d 511", I've got some additional output: <<<execution_status>>> initiation_status="ok" duration=0 termination_type=exited termination_id=0 corefile=no cutime=0 cstime=0 <<<test_end>>> pids still running: orphans still running: -26125 clone03 1 TBROK : unexpected signal 15 received (pid = 26126). clone03 2 TBROK : Remaining cases broken pan was signaled with sig 2... propagating sig 2 to orphaned pgrp -26125 orphans still running: I'll send a patch, that adds wait() to parent. Regards, Jan ------------------------------------------------------------------------------ Keep yourself connected to Go Parallel: BUILD Helping you discover the best ways to construct your parallel projects. http://goparallel.sourceforge.net _______________________________________________ Ltp-list mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ltp-list
