On 10/13/2011 11:27 PM, Voelker, Bernhard wrote: > Pádraig Brady wrote: > >> On 10/13/2011 04:58 PM, Voelker, Bernhard wrote: >>> reopen 9737 >>> thanks >>> >>> Pádraig Brady wrote: >>> >>>> Bah, this is just a racy test I think. >>>> Hopefully the attached fixes it. >>> >>> Thank you for the patch. >>> >>> I tried it 16 times: >>> >> * 14x PASS, execution time real < 0.4s >>> >>> * 1x test failure (in the 5th run) >> >> So the command exited without receiving SIGINT. >> Or perhaps the touch of the 'received.int' file >> is being done asynch. Anything special about your >> file system? > > It's a virtual host on a ESX server farm in our data center. > > ecs@mchp320a:~/berny/depot/coreutils-8.14/tests> uname -a > Linux mchp320a 2.6.16.60-0.74.7-smp #1 SMP Fri Nov 26 09:16:10 UTC 2010 > x86_64 x86_64 x86_64 GNU/Linux > > ecs@mchp320a:~/berny/depot/coreutils-8.14/tests> df -h . > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg01-lvol0 > 50G 15G 33G 31% /user > > ecs@mchp320a:~/berny/depot/coreutils-8.14/tests> mount | grep /user > /dev/mapper/vg01-lvol0 on /user type ext3 (rw,acl,user_xattr) > >>> * 1x the test lasted 20s (in the 16th run) >> >> But this one passed, which means the command >> did receive the SIGINT, but then didn't exit? > > Sounds like one error is shadowing another. > >> I'm confused, sorry, >> Pádraig. > > That's strange, indeed. > > I repeated the test with < 0.2 load 100 times: > the run #5, #18, #28, #53, #58 and #71 resulted in FAIL as above, > and the run #24 and #25 PASSed but took 20 seconds, > all other PASSed within <=0.3s.
I reproduced this weirdness in OpenSuse 10.3 in a VM. Much less frequently though. Delays in 10 out of 2750 Signal handler call failure in 1 out of 2750 The delays might be due to bash, but I updated to 4.2 and the issue still persists. I suspect kernel issues too. Anyway I've attached 2 patches to replace the previous one. The first hopefully addresses any races in the test. I don't think you hit any of these TBH. The second should detect the signal issues and skip the test. cheers, Pádraig.
>From 9de7da8e61849b015e9a28f294fb606f06469182 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= <p...@draigbrady.com> Date: Thu, 13 Oct 2011 14:58:58 +0100 Subject: [PATCH] tests: fix a race in timeout-group This could cause a false failure, or even an infinite loop in rare circumstances. * tests/misc/timeout-group: Increase the timeouts passed to the timeout command, so that they're effectively not used. Instead the command termination is triggered by the kill commands when everything is in the correct state. Reported by Bernhard Voelker. --- tests/misc/timeout-group | 21 ++++++++++++++++----- 1 files changed, 16 insertions(+), 5 deletions(-) diff --git a/tests/misc/timeout-group b/tests/misc/timeout-group index fedd53a..5d31551 100755 --- a/tests/misc/timeout-group +++ b/tests/misc/timeout-group @@ -40,15 +40,25 @@ chmod a+x timeout.cmd cat > group.sh <<\EOF #!/bin/sh -timeout --foreground 5 ./timeout.cmd 10& +trap '' INT +timeout --foreground 25 ./timeout.cmd 20& wait EOF chmod a+x group.sh +check_timeout_cmd_running() +{ + local delay="$1" + test -e timeout.running || + { sleep $delay; return 1; } +} + + # Start above script in its own group. # We could use timeout for this, but that assumes an implementation. setsid ./group.sh & -until test -e timeout.running; do sleep .1; done +# Wait 6.3s for timeout.cmd to start +retry_delay_ check_timeout_cmd_running .1 6 || fail=1 # Simulate a Ctrl-C to the group to test timely exit # Note dash doesn't support signalling groups (a leading -) env kill -INT -- -$! @@ -66,9 +76,10 @@ rm -f int.received timeout.running # Note the first timeout must send a signal that # the second is handling for it to be propagated to the command. # SIGINT, SIGTERM, SIGALRM etc. are implicit. -timeout -sALRM 2 timeout -sINT 10 ./timeout.cmd 5& -until test -e timeout.running; do sleep .1; done -kill -ALRM $! +timeout -sALRM 30 timeout -sINT 25 ./timeout.cmd 20& +# Wait 6.3s for timeout.cmd to start +retry_delay_ check_timeout_cmd_running .1 6 || fail=1 +kill -ALRM $! # trigger the alarm of the first timeout command wait test -e int.received || fail=1 -- 1.7.6
>From 18321eebadd9c6118c9569c8f06be6fba0f5b008 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= <p...@draigbrady.com> Date: Thu, 3 Nov 2011 01:42:43 +0000 Subject: [PATCH] tests: avoid signal issues in timeout-group These issues were seen on an OpenSuse 10.3 system (kernel 2.6.22.5 x86_64, glibc 2.6.1-18, bash updated to 4.2), and also on a 64 bit SLES system with a 2.6.16 kernel. Both systems had 2 CPUs. There were two issues seen. 1. Occasionally the timeout.cmd shell script would block SIGINT until the sleep command exited. 2. Much less frequently the signal handler in the timeout command itself was ignored, causing SIGALRM to kill the process. * tests/misc/timeout-group: Detect the above two cases, and skip rather than fail. Note only issue 2. causes a failure unless skipped, but we skip for case 1. also, for diagnostic purposes. --- tests/misc/timeout-group | 15 +++++++++++++-- 1 files changed, 13 insertions(+), 2 deletions(-) diff --git a/tests/misc/timeout-group b/tests/misc/timeout-group index 5d31551..4c4bbaa 100755 --- a/tests/misc/timeout-group +++ b/tests/misc/timeout-group @@ -73,14 +73,25 @@ rm -f int.received timeout.running # commands that create their own group # This didn't work before 8.13. +start=$(date +%s) + # Note the first timeout must send a signal that # the second is handling for it to be propagated to the command. # SIGINT, SIGTERM, SIGALRM etc. are implicit. timeout -sALRM 30 timeout -sINT 25 ./timeout.cmd 20& +pid=$! # Wait 6.3s for timeout.cmd to start retry_delay_ check_timeout_cmd_running .1 6 || fail=1 -kill -ALRM $! # trigger the alarm of the first timeout command -wait +kill -ALRM $pid # trigger the alarm of the first timeout command +wait $pid +ret=$? +test $ret -eq 124 || + skip_ "timeout returned $ret. SIGALRM not handled?" test -e int.received || fail=1 +end=$(date +%s) + +test $(expr $end - $start) -lt 20 || + skip_ "timeout.cmd didn't receive a signal until after sleep?" + Exit $fail -- 1.7.6