> ## snmp_timeout might be run when it shouldn't > > First of all, I have no experience at all with SNMP and what I observed > is simply wrong or irrelevant. But reading the docs while putting > together our `select` patch, I noted that `snmp_timeout` should only be > run if `select` timed out (from `man 3 snmp_timeout`): > > > If select times out, snmp_timeout should be called to see if the > timeout was intended for SNMP. > > The master loop calls `snmp_timeout()` unconditionally, though: > > r = myselect(maxfd, &rfds, NULL, NULL, tvptr); > if (r == -1 && errno == EAGAIN) continue; > if (r == -1 && errno == EINTR) continue; > if (r == -1) { > /* uh oh */ > fatalf(1, "select failed: %m"); > } > > #if defined(HAVE_UCDSNMP) || defined(HAVE_NETSNMP) > /* check for SNMP queries */ > snmp_read(&rfds); > snmp_timeout(); > #endif > > I'd propose to check for an empty `rfds`-set or zero return value before > running `snmp_timeout` (not sure which one would be more appropriate).
I also don't know thing one about SNMP, and don't seem to have the devel package installed, so I'm basing this solely on man pages found by Google... Background: myselect() is a thin wrapper around select()/pselect(), each of which return the number of ready file descriptors (which is zero if it timed out: none are ready), or -1 if there's an error. We fatal earlier if it returned -1, so by the time we call snmp_read() and snmp_timeout(), we know that r is either zero (timeout) or a count of ready descriptors. As far as I can tell, snmp_read() is only useful if there are ready descriptors, but it's not a problem if there aren't (it just won't find anything to do). And, as far as I can tell, snmp_timeout() is only useful after a timeout has occurred, but it's also not a problem if one hasn't (it just probably won't find anything to do). If my understanding is correct, by calling them unconditionally we're possibly wasting time/energy, but not really hurting anything. We could replace these two lines with something like: if (r) snmp_read(&rfds); else snmp_timeout(); ... and maybe that would be faster. Or, maybe snmp_read() and snmp_timeout()s checks are already pretty well optimised, and making these calls conditional will throw off branch prediction and make things slower, I dunno On Fri, Apr 1, 2016, at 07:02 PM, Jens Erat via Cyrus-devel wrote: > Dear Cyrus developers, > > when searching for the issue with our mail setup crashing all few days, > I've been reading the Cyrus master code for quite some time and found > two possible issues. By the way, I'll send out the promised feedback on > our proposed patch within the next week or two; we're now running the > patch without modified `-T` and everything seems fine so far. > > > ## centry_add hides older entries > > This seems like a minor issue as it might result in broken state in rare > cases, which repairs itself after "hiding" process is terminated again > (which might take days or even weeks, though). > > `centry_add` does simply add a new entry to the ctable, but does not > verify whether it already exists: > > /* add a centry to the global table of all > * centries, using the given pid as the key */ > static void centry_add(struct centry *c, pid_t p) > { > c->pid = p; > c->next = ctable[p % child_table_size]; > ctable[p % child_table_size] = c; > } > > While debugging, we added a `centry_find` call before adding and indeed > observed sporadic events when entries have been added, but not cleaned > up yet. Mostly they occurred when Cyrus was stuck because of the > `select` interrupting the master loop, but very rarely this also > happened during normal operation. I must admit we had a situation with > heavily increased probability for for such issues: > > - Solaris limited to 30k PIDs by default, while we're running 10-12k > processes during normal operation > - `-T 60` joined with a high prefork resulted in imapd processes > terminated frequently > - Single active, rather large machine serving a large number of users, > lots of them using a webmail system with very short IMAP sessions > - Because of the `select` bug, we often didn't have the child janitor > cleaning up the ctable for seconds and minutes > > Generally, this might happen on every setup when having bad luck. > > To fix this issue and completely prevent such hiding of older processes, > I'd propose to run `centry_find` before adding a new entry, and run the > child janitor if needed. > > > ## snmp_timeout might be run when it shouldn't > > First of all, I have no experience at all with SNMP and what I observed > is simply wrong or irrelevant. But reading the docs while putting > together our `select` patch, I noted that `snmp_timeout` should only be > run if `select` timed out (from `man 3 snmp_timeout`): > > > If select times out, snmp_timeout should be called to see if the > timeout was intended for SNMP. > > The master loop calls `snmp_timeout()` unconditionally, though: > > r = myselect(maxfd, &rfds, NULL, NULL, tvptr); > if (r == -1 && errno == EAGAIN) continue; > if (r == -1 && errno == EINTR) continue; > if (r == -1) { > /* uh oh */ > fatalf(1, "select failed: %m"); > } > > #if defined(HAVE_UCDSNMP) || defined(HAVE_NETSNMP) > /* check for SNMP queries */ > snmp_read(&rfds); > snmp_timeout(); > #endif > > I'd propose to check for an empty `rfds`-set or zero return value before > running `snmp_timeout` (not sure which one would be more appropriate). > > Regards, > Jens > > -- > Jens Erat > Universität Konstanz > Kommunikations-, Infomations-, Medienzentrum (KIM) > Abteilung Basisdienste > D-78457 Konstanz > Mail: jens.e...@uni-konstanz.de > > Email had 1 attachment: > + smime.p7s > 7k (application/pkcs7-signature)