http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5731





------- Additional Comments From [EMAIL PROTECTED]  2008-02-04 14:56 -------
got one!  I think...

http://buildbot.spamassassin.org:8010/t-feisty-561/builds/880/step-test/0
contains this:

t/spamd_kill_restart............# Failed test 41 in t/spamd_kill_restart.t at
line 43 fail #5
#  t/spamd_kill_restart.t line 43 is:   ok (!-e $pid_file);
spamd start failed: log: [15535] dbg: logger: adding facilities: all
[15535] dbg: logger: logging level is DBG
[15535] dbg: logger: successfully opened file
log/d.spamd_kill_restart/spamd.err.42.timestamped
[15535] dbg: logger: successfully added file method
[15535] dbg: spamd: will perform setuids? 0
[15535] dbg: spamd: creating INET socket:
[15535] dbg: spamd: Listen: 128
[15535] dbg: spamd: LocalAddr: 127.0.0.1
[15535] dbg: spamd: LocalPort: 48236
[15535] dbg: spamd: Proto: 6
[15535] dbg: spamd: ReuseAddr: 1
[15535] dbg: spamd: Type: 1
[15535] warn: server socket setup failed, retry 1: spamd: could not create INET
socket on 127.0.0.1:48236: Address already in use

http://taint.org/x/2008/bug5731.d.spamd_kill_restart.tgz contains the full
logs from the spamd_kill_restart.t test case, with strace logs.

Here's an examination of the key log, spamd.strace.33.

        First off, this is the last part of signal-handling
        before the error.  note that SIGINT is unblocked:

15514 1202157063.466831 rt_sigprocmask(SIG_UNBLOCK, [INT CHLD], NULL, 8) = 0
15514 1202157063.468826 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
15524 1202157063.493003 time( <unfinished ...>
15514 1202157063.493065 rt_sigaction(SIGALRM, NULL, {SIG_DFL}, 8) = 0
15514 1202157063.493222 rt_sigaction(SIGALRM, {SIG_DFL}, {SIG_DFL}, 8) = 0
15514 1202157063.493367 rt_sigaction(SIGALRM, {0x8099360, [], SA_RESTART}, 
{SIG_DFL}
, 8) = 0
15514 1202157063.493749 alarm(64)       = 0

        various unimportant spamd stuff omitted here.

15514 1202157063.499815 read(7, "I\0\0<\244\n", 6) = 6
15514 1202157063.500038 time(NULL)      = 1202157063
15514 1202157063.500127 write(4, "Mon Feb  4 12:31:03 2008 [15514]"..., 77) = 77
15514 1202157063.500284 write(2, "[15514] dbg: prefork: child 1552"..., 52) = 52
15514 1202157063.503575 time(NULL)      = 1202157063
15514 1202157064.696936 write(4, "Mon Feb  4 12:31:03 2008 [15514]"..., 74) = 74
15514 1202157064.697259 write(2, "[15514] dbg: prefork: new lowest"..., 49) = 49
15514 1202157064.697604 time(NULL)      = 1202157064
15514 1202157064.697699 write(4, "Mon Feb  4 12:31:04 2008 [15514]"..., 66) = 66
15514 1202157064.697891 write(2, "[15514] dbg: prefork: child repo"..., 41) = 41

        at this point, the test script kills the
        spamd using SIGINT:

15514 1202157064.698111 --- SIGINT (Interrupt) @ 0 (0) ---

        we should see perl calling into its $SIG{INT} handler.
        instead we see this:

15514 1202157064.698255 mmap2(NULL, 2097152, PROT_NONE, 
MAP_PRIVATE|MAP_ANONYMOUS|MA
P_NORESERVE, -1, 0) = 0x408e8000
15514 1202157064.698376 munmap(0x408e8000, 98304) = 0
15514 1202157064.698466 munmap(0x40a00000, 950272) = 0
15514 1202157064.698596 mprotect(0x40900000, 135168, PROT_READ|PROT_WRITE) = 0
15514 1202157064.698743 futex(0x401a1120, FUTEX_WAIT, 2, NULL <unfinished ...>

        this futex() call sleeps.  Nothing happens until eventually
        a child's select() call hits its timeout:

15524 1202157363.464875 <... select resumed> ) = 0 (Timeout)

        the kids exit due to their timeouts:

15523 1202157364.652244 exit_group(0)   = ?

        and the parent proc is interrupted for the SIGCHLD:

15514 1202157364.675992 <... futex resumed> ) = -1 EINTR (Interrupted system 
call)
15514 1202157364.676062 --- SIGCHLD (Child exited) @ 0 (0) ---

        but then goes back into the futex():

15514 1202157364.676664 futex(0x401a1120, FUTEX_WAIT, 2, NULL <unfinished ...>

        and never exits.


Comparing the time when the SIGINT arrives to other test cases that worked ok
(e.g. spamd.strace.1), I see no difference; the SIGINT handler is unblocked,
and it should be able to handle the signal.

This looks like a perl signal-handling bug.

for reference, here's what should have happened at the SIGINT:

15473 1202157028.628858 --- SIGINT (Interrupt) @ 0 (0) ---
15473 1202157028.629083 time(NULL)      = 1202157028
15473 1202157028.629191 write(4, "Mon Feb  4 12:30:28 2008 [15473]"..., 85) = 85
15473 1202157028.629362 write(2, "[15473] info: spamd: server kill"..., 60) = 60
15473 1202157028.629710 close(5)        = 0
15473 1202157028.629789 munmap(0x402ff000, 4096) = 0
15473 1202157028.629875 close(5)        = -1 EBADF (Bad file descriptor)
15473 1202157028.629947 munmap(0x402fe000, 4096) = 0
15473 1202157028.630033
unlink("/home/buildbot/slaves/t-feisty-561/t-feisty-561/build/t/log/spamd.pid") 
= 0

etc.
futex() isn't called in the file *at all*.  It ONLY appears in
the failing strace....

At this point I think I should switch to testing a later perl
build; fiddling with perl 5.6.1 signal-handling is a bit of a waste
of time, and I noted that the Ubuntu issue we're trying to track down
should occur with 5.8.x too.  Ideally it should be repro'able with
the gutsy package of perl.

Also worth trying: tracing with ltrace instead of strace, to get
a better idea of wtf that futex() call is up to.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to