Hi again, > I'll read through the thread again and see if it becomes clear, and > won't push the patch to the 2.5 branch just yet in case this changes it.
I've read back through the thread, and it looks like our goals here are to: * acknowledge that, with pselect() available, SIGCHLD, SIGQUIT, SIGINT and SIGTERM signals will -only- arrive during a call to myselect(), because they are blocked at all other times -- so we must expect a server with lots of short-lived processes to frequently interrupt out of myselect() * handle signals as soon as possible after they arrive * allow child_janitor() an opportunity to run at least sometimes during interrupt-heavy periods * minimise useless work In reading about how select, pselect, signals and EAGAIN interact, I've noticed something: we currently treat the return value from myselect() as either success or failure. We handle a few classes of failure but treat all successes as identical. But the return value is actually a number of ready descriptors (with negative -- specifically, -1 -- indicating failure). We can use this to apply the post-select processing sensibly based on whether it's relevant or not. This simplifies a few things: * no more jump-back-to-top-on-signal, so we never skip the end of the loop entirely * we only need to loop through the services if we know there are going to be ready file descriptors for at least some of them * child_janitor() gets to run even during interrupt storms * SNMP handling can be selective (based on examples from SNMP docs, but untested) * no need for an inner loop around myselect(), so no contortions to make that safe, and signals don't get stuck Alternative patch implementing this approach is here: https://github.com/elliefm/cyrus-imapd/tree/v25/alt-master-eintr It seems fine to run child_janitor() at high frequency -- it already contains calculations to limit its own processing to one round per second, based on time elapsed since the last run. Thoughts? ellie On Wed, Apr 27, 2016, at 12:20 PM, ellie timoney via Cyrus-devel wrote: > Hi Jens, > > > your argumentation seems legit. > > Thanks :) > > > I'm not sure, though whether the master > > should _ever_ wait forever (NULL-pointer). Maybe this should never > > happen, but I'd propose _not_ waiting would be safer than waiting > > forever in some weird cases. And it should never happen but during > > startup and shutdown. Not waiting might result in some CPU cycles > > wasted, but the general code might prove more robust to other minor > > issues. > > I can see a plausible argument here both ways: on the one hand, if we > never use a NULL timeout, then we'll never hang like we did during > shutdown in this particular case. On the other hand, waking up master > periodically when it has no work to do will potentially degrade > performance for the actual services that are trying to run. > > Also, hanging at shutdown in the way we did helped us identify the > problem really quickly and fix it before it made it into a release. > > I'm not really sure how frequently master is likely to wake up during > "normal" operation, but I expect it varies a lot by environment and > configuration. It's hard to estimate where a reasonable compromise > between "don't block indefinitely" and "don't block at all" might lie. > > I could see the addition of some kind of hang-notification wakeup, > whereby if master has been asleep for longer than some configurable > duration, it wakes up and logs a warning. The natural spot to do this, > though, would be as another scheduled event, which wouldn't do anything > in the particular case of "no schedule" that I think you're interested > in. So to be useful at all it would require bigger changes to the tvptr > calculation. > > I've had a look through the history of master.c, and the timeout has > been NULL since the master process was conceived. Only the addition of > the event scheduler introduced the optional timeout. Given it's been > working well enough like this basically forever, I don't want to change > it on our stable branch. I would consider a feature like the above for > inclusion in 3.0, though, if anyone was interested enough to contribute > a patch? > > > You have the better understanding of the overall master loop logic and > > what to do best > > You flatter me, but I'm figuring this out as I go too :) > > > I'm fine with either solutions (as long as it works). > > I'll push what we've got (your patch plus my shutdown fix) to the 2.5 > branch for inclusion in the next release, and also forward port it to > the master branch... > > Though: I've just noticed another possible issue. We might get stuck in > that inner loop if select keeps raising EAGAIN (it has no counter, and > the "break" statement breaks out of the switch, not the inner loop). > > My system's documentation for select claims that "some other UNIX > systems" can raise EGAIN (despite POSIX not specifying this error for > select), and suggests portable programs should retry as if for EINTR in > this case. > > I guess EAGAIN didn't occur in your profiling, but maybe your system > can't raise EGAIN from select either? > > The old (prior to patch) behaviour for EAGAIN was to resume the outer > loop from the top, same as EINTR. > > Right now I'm not sure if we should try to keep the old behaviour (retry > from top on EGAIN), try to keep it the same as EINTR (so give it a > counter and a limit), or do something new (like just pretend it was a > timeout, like we now do after too many interrupts, so that the end of > loop processing can occur). Any thoughts? > > I'll read through the thread again and see if it becomes clear, and > won't push the patch to the 2.5 branch just yet in case this changes it. > > Cheers, > > ellie > > On Tue, Apr 26, 2016, at 08:08 PM, Jens Erat wrote: > > Hi Ellie, > > > > your argumentation seems legit. I'm not sure, though whether the master > > should _ever_ wait forever (NULL-pointer). Maybe this should never > > happen, but I'd propose _not_ waiting would be safer than waiting > > forever in some weird cases. And it should never happen but during > > startup and shutdown. Not waiting might result in some CPU cycles > > wasted, but the general code might prove more robust to other minor > > issues. > > > > You have the better understanding of the overall master loop logic and > > what to do best, I'm fine with either solutions (as long as it works). > > > > Regards, > > Jens > > > > Am 26.04.2016 um 07:18 schrieb ellie timoney: > > > Hi Jens, > > > > > >>> I'm having a look at the way tvptr is set. If the process is > > >>> in_shutdown, then tvptr remains NULL (which tells myselect() to block > > >>> indefinitely and never time out). I'm not sure if this is going to turn > > >>> out to be significant here. I'm also not sure why we do this > > >>> differently for this case. > > >>> > > >>> [...] > > >>> > > >>> What I think is happening is that, once shutdown is initiated, the only > > >>> way myselect() can return is if one of the rfds becomes active, or if a > > >>> signal arrives. It will no longer time out. But, once shutdown is > > >>> initiated, at a certain point rfds are no longer becoming active > > >>> (because all the children have themselves shut down), and signals are no > > >>> longer occurring, so myselect() ends up blocked forever. > > >> > > >> I don't think we should block at all in this case. The master loop > > >> should keep running and clean up. This is definitely a bug in my patch. > > > > > > I've read through the old/new code some more, and I don't think this is > > > a bug in your patch. > > > > > > The logic prior to your patch was that myselect would get a NULL timeout > > > by default (and thus block indefinitely waiting for a signal or ready > > > file descriptors). This makes sense: we want to block until there is > > > something to do. > > > > > > If there were a schedule (and we were not shutting down), then we would > > > inspect it and set up a timeout, so that we would also unblock when it > > > was time to process that. The timeout might be 0,0 if the next event > > > were already overdue, so that it could be processed immediately. > > > > > > If we were shutting down, we'd no longer be processing scheduled events, > > > so a scheduled event being due was not cause for a wakeup in this case. > > > > > > Your patch preserves this behaviour, correctly, I think. > > > > > > Where it causes problems is that it would now potentially call (and > > > block in) myselect multiple times in a row without doing any other > > > processing in between. Which seems fine and good for the case we're > > > trying to fix, but turns out to be bad during shutdown. > > > > > > During shutdown, we expect to get a bunch of SIGCHLDs for the exiting > > > children, which are reaped in the outer loop, before the myselect() call > > > (and after reaping, if there are no more children, we exit). But with > > > pselect, SIGCHLD is blocked until pselect temporarily unblocks it, so > > > these signals will always arrive during myselect(). This means that > > > during shutdown, myselect() will legitimately return -1/EINTR a number > > > of times, as the child processes exit, and we need to reap them and > > > decrement counters in order to exit cleanly. > > > > > > With your patch, we no longer reap/decrement/exit on a signal unless > > > we've also coincidentally exceeded 5 consecutive EINTRs, and have thus > > > broken out of the myselect() loop -- which doesn't necessarily happen. > > > > > > Understanding the cause, I'm even more sure that my fix for the shutdown > > > problem is correct: when we're in_shutdown, we /want/ any single EINTR > > > to send us back around the outer loop so we can reap children and maybe > > > exit. > > > > > >>> Maybe the answer is to break out of the myselect loop after one > > >>> iteration if we're in_shutdown, so that we don't block indefinitely > > >>> waiting for iterations that will never occur. I think just changing the > > >>> loop condition from > > >>> while (r == -1) > > >>> to > > >>> while (!in_shutdown && r == -1) > > >>> might do the trick. > > >> > > >> I'd instead set a non-null, but zero timeout, so select will still clean > > >> up the signal mask and query for FDs. I pushed an alternative fixup on > > >> GitHub: > > >> > > >> https://github.com/JensErat/cyrus-imapd/commit/bc3bacf70261711a9c810107cddb081df65a552c > > > > > > This is a much bigger change. It inverts the logic ("wait indefinitely > > > for something to do, unless schedule says to wake earlier" => "don't > > > wait, unless schedule says how long to wait"). It also depends on there > > > always being an upcoming scheduled event, otherwise the process will > > > spin on the Services[] array and burn CPU for no reason. > > > > > >> Looking at the code, I'm wondering whether we need the test for > > >> scheduled events at all. As I understand, the child janitor is always > > >> registered as an event running all 10 seconds, so there should be no > > >> branch for schedule == NULL. Anyway, testing probably won't hurt, either. > > > > > > The test for scheduled events is the difference between "block until a > > > signal occurs or an fd becomes ready" and "block until a signal occurs, > > > an fd becomes ready, or it is time to process a scheduled event". If > > > there's no scheduled events, we don't need to wake up in time for one, > > > so we can just block until other work becomes available. > > > > > > The only time we want to not block is if we know there's a scheduled > > > event needing immediate processing (due now, or overdue). > > > > > >> If I'm not mistaken, this actually also might have been an issue before > > >> -- but don't ask me why it wasn't triggered. Probably because of good > > >> luck and some interrupts (eg. the shutdown signal) being sufficient up > > >> to now. > > > > > > This wasn't a problem before, because the exiting children would wake up > > > myselect(), be processed and cleaned up at the top of the next iteration > > > of the outer loop, and we would only call myselect() again if there were > > > still children that we were waiting for (who would wake us up again when > > > they finished). > > > > > > It's a problem with your patch because when we get a SIGCHLD we > > > immediately myselect() again without cleaning up or checking whether > > > there's even any more children to wait for. (It can be moved along by > > > the operator sending a few more signals, to manually exceed the 5 limit, > > > but a clean shutdown shouldn't require this.) > > > > > > I think the original timeout behaviour is correct. I think the patch for > > > dealing with the repeated interrupts looks good, except for the shutdown > > > problem; and I think the right way to deal with the shutdown problem is > > > by explicitly testing for that, and not looping myselect() in that case. > > > > > > Cheers, > > > > > > ellie > > > > > > On Fri, Apr 15, 2016, at 07:42 PM, Jens Erat wrote: > > >> Hi Ellie, > > >> > > >>> It is failing our Cassandane tests, because the test suite is unable to > > >>> cleanly shutdown the master process after each test finishes. > > >>> > > >>> Are you able to cleanly shut down the master process with a SIGTERM in > > >>> your environment? Can you cleanly shut one down that has very little > > >>> activity on it? > > >> > > >> Indeed. We also had the idled-not-terminating-issue and still applied a > > >> kill -9 after some time, so did not observe the issue yet. > > >> > > >>> I'm having a look at the way tvptr is set. If the process is > > >>> in_shutdown, then tvptr remains NULL (which tells myselect() to block > > >>> indefinitely and never time out). I'm not sure if this is going to turn > > >>> out to be significant here. I'm also not sure why we do this > > >>> differently for this case. > > >>> > > >>> [...] > > >>> > > >>> What I think is happening is that, once shutdown is initiated, the only > > >>> way myselect() can return is if one of the rfds becomes active, or if a > > >>> signal arrives. It will no longer time out. But, once shutdown is > > >>> initiated, at a certain point rfds are no longer becoming active > > >>> (because all the children have themselves shut down), and signals are no > > >>> longer occurring, so myselect() ends up blocked forever. > > >> > > >> I don't think we should block at all in this case. The master loop > > >> should keep running and clean up. This is definitely a bug in my patch. > > >> > > >> > > >>> Maybe the answer is to break out of the myselect loop after one > > >>> iteration if we're in_shutdown, so that we don't block indefinitely > > >>> waiting for iterations that will never occur. I think just changing the > > >>> loop condition from > > >>> while (r == -1) > > >>> to > > >>> while (!in_shutdown && r == -1) > > >>> might do the trick. > > >> > > >> I'd instead set a non-null, but zero timeout, so select will still clean > > >> up the signal mask and query for FDs. I pushed an alternative fixup on > > >> GitHub: > > >> > > >> https://github.com/JensErat/cyrus-imapd/commit/bc3bacf70261711a9c810107cddb081df65a552c > > >> > > >> Looking at the code, I'm wondering whether we need the test for > > >> scheduled events at all. As I understand, the child janitor is always > > >> registered as an event running all 10 seconds, so there should be no > > >> branch for schedule == NULL. Anyway, testing probably won't hurt, either. > > >> > > >> Because the timeout is not only NULLed, but also set to zero as default, > > >> we will not block any more iff > > >> > > >> - we're in shutdown or > > >> - no schedule is set up (which should not happen, anyway, but we have a > > >> safe fallback if the schedule is not completely set up/already torn down > > >> on shutdown) > > >> > > >> If I'm not mistaken, this actually also might have been an issue before > > >> -- but don't ask me why it wasn't triggered. Probably because of good > > >> luck and some interrupts (eg. the shutdown signal) being sufficient up > > >> to now. > > >> > > >>> This behaviour also suggests that a reasonable limit for interruptions > > >>> might be one that scales with the number of child processes (rather than > > >>> the arbitrary 5), but I don't think this alone would be enough to avoid > > >>> the problem I just encountered. If the remaining children all exit at > > >>> the same time, and get processed by a single myselect(), then there's > > >>> still n-1 loop iterations before shutdown would occur -- same problem. > > >> > > >> I'm not sure about this. This is the maximum number of SIGCHLDs, but we > > >> also might fetch a SIGTERM on the go. Anyway: It is _very_ unlikely that > > >> we get just another signal in-between the select calls. At least on our > > >> setup, we never observed it at all, and this is a rather high-volume > > >> service! Usually, the signals will arrive (but be blocked) somewhere > > >> else in the master loop, and multiple signals are cleaned up with a > > >> single (first) select statement. > > >> > > >> Anyway, nothing bad happens if we _sometimes_ pass over the message > > >> handling, and at least we're now logging a warning after some retries. > > >> If somebody has severe issues with this, he'll get aware of the reasons > > >> by tweaking his configuration (for example by setting a more reasonable > > >> -T parameter) or digging deeper. > > >> > > >> Regards, > > >> Jens > > >> > > >> -- > > >> Jens Erat > > >> Universität Konstanz > > >> Kommunikations-, Infomations-, Medienzentrum (KIM) > > >> Abteilung Basisdienste > > >> D-78457 Konstanz > > >> Mail: jens.e...@uni-konstanz.de > > >> > > >> Email had 1 attachment: > > >> + smime.p7s > > >> 7k (application/pkcs7-signature) > > > > -- > > Jens Erat > > Universität Konstanz > > Kommunikations-, Infomations-, Medienzentrum (KIM) > > Abteilung Basisdienste > > D-78457 Konstanz > > Mail: jens.e...@uni-konstanz.de > > > > Email had 1 attachment: > > + smime.p7s > > 7k (application/pkcs7-signature)