Hi Ellie, > It is failing our Cassandane tests, because the test suite is unable to > cleanly shutdown the master process after each test finishes. > > Are you able to cleanly shut down the master process with a SIGTERM in > your environment? Can you cleanly shut one down that has very little > activity on it?
Indeed. We also had the idled-not-terminating-issue and still applied a kill -9 after some time, so did not observe the issue yet. > I'm having a look at the way tvptr is set. If the process is > in_shutdown, then tvptr remains NULL (which tells myselect() to block > indefinitely and never time out). I'm not sure if this is going to turn > out to be significant here. I'm also not sure why we do this > differently for this case. > > [...] > > What I think is happening is that, once shutdown is initiated, the only > way myselect() can return is if one of the rfds becomes active, or if a > signal arrives. It will no longer time out. But, once shutdown is > initiated, at a certain point rfds are no longer becoming active > (because all the children have themselves shut down), and signals are no > longer occurring, so myselect() ends up blocked forever. I don't think we should block at all in this case. The master loop should keep running and clean up. This is definitely a bug in my patch. > Maybe the answer is to break out of the myselect loop after one > iteration if we're in_shutdown, so that we don't block indefinitely > waiting for iterations that will never occur. I think just changing the > loop condition from > while (r == -1) > to > while (!in_shutdown && r == -1) > might do the trick. I'd instead set a non-null, but zero timeout, so select will still clean up the signal mask and query for FDs. I pushed an alternative fixup on GitHub: https://github.com/JensErat/cyrus-imapd/commit/bc3bacf70261711a9c810107cddb081df65a552c Looking at the code, I'm wondering whether we need the test for scheduled events at all. As I understand, the child janitor is always registered as an event running all 10 seconds, so there should be no branch for schedule == NULL. Anyway, testing probably won't hurt, either. Because the timeout is not only NULLed, but also set to zero as default, we will not block any more iff - we're in shutdown or - no schedule is set up (which should not happen, anyway, but we have a safe fallback if the schedule is not completely set up/already torn down on shutdown) If I'm not mistaken, this actually also might have been an issue before -- but don't ask me why it wasn't triggered. Probably because of good luck and some interrupts (eg. the shutdown signal) being sufficient up to now. > This behaviour also suggests that a reasonable limit for interruptions > might be one that scales with the number of child processes (rather than > the arbitrary 5), but I don't think this alone would be enough to avoid > the problem I just encountered. If the remaining children all exit at > the same time, and get processed by a single myselect(), then there's > still n-1 loop iterations before shutdown would occur -- same problem. I'm not sure about this. This is the maximum number of SIGCHLDs, but we also might fetch a SIGTERM on the go. Anyway: It is _very_ unlikely that we get just another signal in-between the select calls. At least on our setup, we never observed it at all, and this is a rather high-volume service! Usually, the signals will arrive (but be blocked) somewhere else in the master loop, and multiple signals are cleaned up with a single (first) select statement. Anyway, nothing bad happens if we _sometimes_ pass over the message handling, and at least we're now logging a warning after some retries. If somebody has severe issues with this, he'll get aware of the reasons by tweaking his configuration (for example by setting a more reasonable -T parameter) or digging deeper. Regards, Jens -- Jens Erat Universität Konstanz Kommunikations-, Infomations-, Medienzentrum (KIM) Abteilung Basisdienste D-78457 Konstanz Mail: jens.e...@uni-konstanz.de
smime.p7s
Description: S/MIME Cryptographic Signature