Hi Alan,
On Wed, Mar 18, 2015 at 01:11:32PM +0000, Alan Fitton wrote:
> Basically the signal_queue isn't being updated with a reference to SIGTTOU,
> because signal_state[SIGTTOU].count is > 0. I guess there's an assumption in
> the code that if any given signal already has events counted up in
> signal_state, then it must have updated signal_queue so they will get
> processed soon.
This is indeed what the code does :
if (!signal_state[sig].count) {
/* signal was not queued yet */
if (signal_queue_len < MAX_SIGNAL)
signal_queue[signal_queue_len++] = sig;
else
qfprintf(stderr, "Signal %d : signal queue is
unexpectedly full.\n", sig);
}
signal_state[sig].count++;
So there's theorically no way to have a non-zero count value
with a zero signal_queue_len, unless one of these gets corrupted
at some point.
Also, __signal_process_queue() seems to properly count these :
for (cur_pos = 0; cur_pos < signal_queue_len; cur_pos++) {
sig = signal_queue[cur_pos];
desc = &signal_state[sig];
if (desc->count) {
struct sig_handler *sh, *shb;
list_for_each_entry_safe(sh, shb, &desc->handlers,
list) {
if ((sh->flags & SIG_F_TYPE_FCT) && sh->handler)
((void (*)(struct sig_handler
*))sh->handler)(sh);
else if ((sh->flags & SIG_F_TYPE_TASK) &&
sh->handler)
task_wakeup(sh->handler, sh->arg |
TASK_WOKEN_SIGNAL);
}
desc->count = 0;
}
}
signal_queue_len = 0;
> But from what I see below, this doesn't seem to be the case
> always, and then all events of a particular signal can end up getting "lost".
> I think there is some timing or logic issue here.
>
> (22 = SIGTTOU)
>
> /* Break on SIGTTOU. There are 805 events in the
> Program received signal SIGTTOU, Stopped (tty output).
> 0x00002b369ab6a373 in __epoll_wait_nocancel () from /lib64/libc.so.6
> (gdb) print signal_state[22]
> $16 = {count = 805, handlers = {n = 0xe1efa80, p = 0xe1efa80}}
> (gdb) print signal_queue_len
> $17 = 0
That clearly demonstrates a bug! Well, thinking about it now, there would
be a possibility : if the signal is delivered while we're in
__signal_process_queue(), what you observe could indeed happen, because
we'd miss the desc->count and clear signal_queue_len afterwards.
Could you please try to instrument this function to confirm if the issue
is there ? If so we need to use a different set of variables to process
this and protect the loop.
I'll try to do something about it. I've already got a report of a reload
not working once in a while but had no info around it so I attributed it
to a PEBKAC-style issue. If you could share a reproducer, it would really
help. Given your sig count, I guess you send signals in loops ?
Thanks,
Willy