On 10/13/2011 11:27 PM, Voelker, Bernhard wrote:
> Pádraig Brady wrote:
> 
>> On 10/13/2011 04:58 PM, Voelker, Bernhard wrote:
>>> reopen 9737
>>> thanks
>>>
>>> Pádraig Brady wrote:
>>>
>>>> Bah, this is just a racy test I think.
>>>> Hopefully the attached fixes it.
>>>
>>> Thank you for the patch.
>>>
>>> I tried it 16 times:
>>>
>> * 14x PASS, execution time real < 0.4s
>>>
>>> * 1x test failure (in the 5th run)
>>
>> So the command exited without receiving SIGINT.
>> Or perhaps the touch of the 'received.int' file
>> is being done asynch. Anything special about your
>> file system?
> 
> It's a virtual host on a ESX server farm in our data center.
> 
> ecs@mchp320a:~/berny/depot/coreutils-8.14/tests> uname -a
> Linux mchp320a 2.6.16.60-0.74.7-smp #1 SMP Fri Nov 26 09:16:10 UTC 2010 
> x86_64 x86_64 x86_64 GNU/Linux
> 
> ecs@mchp320a:~/berny/depot/coreutils-8.14/tests> df -h .
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/mapper/vg01-lvol0
>                        50G   15G   33G  31% /user
> 
> ecs@mchp320a:~/berny/depot/coreutils-8.14/tests> mount | grep /user
> /dev/mapper/vg01-lvol0 on /user type ext3 (rw,acl,user_xattr)
> 
>>> * 1x the test lasted 20s (in the 16th run)
>>
>> But this one passed, which means the command
>> did receive the SIGINT, but then didn't exit?
> 
> Sounds like one error is shadowing another.
> 
>> I'm confused, sorry,
>> Pádraig.
> 
> That's strange, indeed.
> 
> I repeated the test with < 0.2 load 100 times:
> the run #5, #18, #28, #53, #58 and #71 resulted in FAIL as above,
> and the run #24 and #25 PASSed but took 20 seconds,
> all other PASSed within <=0.3s.

I reproduced this weirdness in OpenSuse 10.3 in a VM.
Much less frequently though.
Delays in 10 out of 2750
Signal handler call failure in 1 out of 2750

The delays might be due to bash, but I updated
to 4.2 and the issue still persists.
I suspect kernel issues too.

Anyway I've attached 2 patches to replace the previous one.
The first hopefully addresses any races in the test.
I don't think you hit any of these TBH.

The second should detect the signal issues and skip the test.

cheers,
Pádraig.
>From 9de7da8e61849b015e9a28f294fb606f06469182 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <p...@draigbrady.com>
Date: Thu, 13 Oct 2011 14:58:58 +0100
Subject: [PATCH] tests: fix a race in timeout-group

This could cause a false failure, or even
an infinite loop in rare circumstances.

* tests/misc/timeout-group: Increase the timeouts
passed to the timeout command, so that they're
effectively not used.  Instead the command termination
is triggered by the kill commands when everything
is in the correct state.

Reported by Bernhard Voelker.
---
 tests/misc/timeout-group |   21 ++++++++++++++++-----
 1 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/tests/misc/timeout-group b/tests/misc/timeout-group
index fedd53a..5d31551 100755
--- a/tests/misc/timeout-group
+++ b/tests/misc/timeout-group
@@ -40,15 +40,25 @@ chmod a+x timeout.cmd
 
 cat > group.sh <<\EOF
 #!/bin/sh
-timeout --foreground 5 ./timeout.cmd 10&
+trap '' INT
+timeout --foreground 25 ./timeout.cmd 20&
 wait
 EOF
 chmod a+x group.sh
 
+check_timeout_cmd_running()
+{
+  local delay="$1"
+  test -e timeout.running ||
+    { sleep $delay; return 1; }
+}
+
+
 # Start above script in its own group.
 # We could use timeout for this, but that assumes an implementation.
 setsid ./group.sh &
-until test -e timeout.running; do sleep .1; done
+# Wait 6.3s for timeout.cmd to start
+retry_delay_ check_timeout_cmd_running .1 6 || fail=1
 # Simulate a Ctrl-C to the group to test timely exit
 # Note dash doesn't support signalling groups (a leading -)
 env kill -INT -- -$!
@@ -66,9 +76,10 @@ rm -f int.received timeout.running
 # Note the first timeout must send a signal that
 # the second is handling for it to be propagated to the command.
 # SIGINT, SIGTERM, SIGALRM etc. are implicit.
-timeout -sALRM 2 timeout -sINT 10 ./timeout.cmd 5&
-until test -e timeout.running; do sleep .1; done
-kill -ALRM $!
+timeout -sALRM 30 timeout -sINT 25 ./timeout.cmd 20&
+# Wait 6.3s for timeout.cmd to start
+retry_delay_ check_timeout_cmd_running .1 6 || fail=1
+kill -ALRM $! # trigger the alarm of the first timeout command
 wait
 test -e int.received || fail=1
 
-- 
1.7.6

>From 18321eebadd9c6118c9569c8f06be6fba0f5b008 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <p...@draigbrady.com>
Date: Thu, 3 Nov 2011 01:42:43 +0000
Subject: [PATCH] tests: avoid signal issues in timeout-group

These issues were seen on an OpenSuse 10.3 system
(kernel 2.6.22.5 x86_64, glibc 2.6.1-18, bash updated to 4.2),
and also on a 64 bit SLES system with a 2.6.16 kernel.
Both systems had 2 CPUs.

There were two issues seen.  1. Occasionally the
timeout.cmd shell script would block SIGINT until
the sleep command exited.  2. Much less frequently the
signal handler in the timeout command itself was ignored,
causing SIGALRM to kill the process.

* tests/misc/timeout-group: Detect the above two cases,
and skip rather than fail.  Note only issue 2. causes
a failure unless skipped, but we skip for case 1. also,
for diagnostic purposes.
---
 tests/misc/timeout-group |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/tests/misc/timeout-group b/tests/misc/timeout-group
index 5d31551..4c4bbaa 100755
--- a/tests/misc/timeout-group
+++ b/tests/misc/timeout-group
@@ -73,14 +73,25 @@ rm -f int.received timeout.running
 # commands that create their own group
 # This didn't work before 8.13.
 
+start=$(date +%s)
+
 # Note the first timeout must send a signal that
 # the second is handling for it to be propagated to the command.
 # SIGINT, SIGTERM, SIGALRM etc. are implicit.
 timeout -sALRM 30 timeout -sINT 25 ./timeout.cmd 20&
+pid=$!
 # Wait 6.3s for timeout.cmd to start
 retry_delay_ check_timeout_cmd_running .1 6 || fail=1
-kill -ALRM $! # trigger the alarm of the first timeout command
-wait
+kill -ALRM $pid # trigger the alarm of the first timeout command
+wait $pid
+ret=$?
+test $ret -eq 124 ||
+  skip_ "timeout returned $ret. SIGALRM not handled?"
 test -e int.received || fail=1
 
+end=$(date +%s)
+
+test $(expr $end - $start) -lt 20 ||
+  skip_ "timeout.cmd didn't receive a signal until after sleep?"
+
 Exit $fail
-- 
1.7.6

Reply via email to