Dear Cyrus developers, when searching for the issue with our mail setup crashing all few days, I've been reading the Cyrus master code for quite some time and found two possible issues. By the way, I'll send out the promised feedback on our proposed patch within the next week or two; we're now running the patch without modified `-T` and everything seems fine so far.
## centry_add hides older entries This seems like a minor issue as it might result in broken state in rare cases, which repairs itself after "hiding" process is terminated again (which might take days or even weeks, though). `centry_add` does simply add a new entry to the ctable, but does not verify whether it already exists: /* add a centry to the global table of all * centries, using the given pid as the key */ static void centry_add(struct centry *c, pid_t p) { c->pid = p; c->next = ctable[p % child_table_size]; ctable[p % child_table_size] = c; } While debugging, we added a `centry_find` call before adding and indeed observed sporadic events when entries have been added, but not cleaned up yet. Mostly they occurred when Cyrus was stuck because of the `select` interrupting the master loop, but very rarely this also happened during normal operation. I must admit we had a situation with heavily increased probability for for such issues: - Solaris limited to 30k PIDs by default, while we're running 10-12k processes during normal operation - `-T 60` joined with a high prefork resulted in imapd processes terminated frequently - Single active, rather large machine serving a large number of users, lots of them using a webmail system with very short IMAP sessions - Because of the `select` bug, we often didn't have the child janitor cleaning up the ctable for seconds and minutes Generally, this might happen on every setup when having bad luck. To fix this issue and completely prevent such hiding of older processes, I'd propose to run `centry_find` before adding a new entry, and run the child janitor if needed. ## snmp_timeout might be run when it shouldn't First of all, I have no experience at all with SNMP and what I observed is simply wrong or irrelevant. But reading the docs while putting together our `select` patch, I noted that `snmp_timeout` should only be run if `select` timed out (from `man 3 snmp_timeout`): > If select times out, snmp_timeout should be called to see if the timeout was intended for SNMP. The master loop calls `snmp_timeout()` unconditionally, though: r = myselect(maxfd, &rfds, NULL, NULL, tvptr); if (r == -1 && errno == EAGAIN) continue; if (r == -1 && errno == EINTR) continue; if (r == -1) { /* uh oh */ fatalf(1, "select failed: %m"); } #if defined(HAVE_UCDSNMP) || defined(HAVE_NETSNMP) /* check for SNMP queries */ snmp_read(&rfds); snmp_timeout(); #endif I'd propose to check for an empty `rfds`-set or zero return value before running `snmp_timeout` (not sure which one would be more appropriate). Regards, Jens -- Jens Erat Universität Konstanz Kommunikations-, Infomations-, Medienzentrum (KIM) Abteilung Basisdienste D-78457 Konstanz Mail: jens.e...@uni-konstanz.de
smime.p7s
Description: S/MIME Cryptographic Signature