Hi Bron, digging deeper, we realized there are still issues with the loop you proposed. `select` will only decrement tvptr on Linux (and maybe _some_ other systems, definitely not all of them), pselect will never do -- you might end up waiting forever here if nothing happens.
On the other hand, we partially got a grasp of understanding what was going on. Because when `pselect` is available, signalling was disabled but during `pselect`, thus with some weird combination of load on the different services, each `pselect` returned with `EINTR` and the message handling code was never reached We ended up putting the whole deadline calculation in a queue, and have this code live at the moment: int interrupted = 0; int again = 0; do { /* how long to wait? - do now so that any scheduled wakeup * calls get accounted for*/ gettimeofday(&now, 0); tvptr = NULL; if (schedule && !in_shutdown) { double delay = timesub(&now, &schedule->mark); if (!interrupted && delay > 0.0) { timeval_set_double(&tv, delay); } else { tv.tv_sec = 0; tv.tv_usec = 0; } tvptr = &tv; } #if defined(HAVE_UCDSNMP) || defined(HAVE_NETSNMP) if (tvptr == NULL) blockp = 1; snmp_select_info(&maxfd, &rfds, tvptr, &blockp); #endif errno = 0; r = myselect(maxfd, &rfds, NULL, NULL, tvptr); if (r == -1) switch(errno) { /* Try again to get valid rfds, this time without blocking so we * will definitely process messages without getting interrupted * again. */ case EINTR: interrupted++; if (interrupted > 100) { syslog(LOG_WARNING, "Repeatedly interrupted, too many signals?"); /* Fake a timeout */ r = 0; FD_ZERO(&rfds); } break; /* Try again. */ case EAGAIN: again++; break; /* uh oh */ default: fatalf(1, "select failed: %m"); } } while (r == -1); We also dump interrupted and again together with some other variables at the end of each loop. The server is running fine for two days now, and the dumped information looks fine (we never experienced any delays >4 seconds since patching the server, we often found the master loop not completely running for minutes). I will report again if we didn't suffer from any interruptions for the next one or two weeks. Some remarks to the code: - We need to update the current timestamp `now` for decrementing the passed time. - `snmp_select_info` or's its FDs to rfds; thus it may be run repeatedly to update `tvptr` and not run over any SNMP events. - We are compiling without SNMP, thus this code is untested. - If interrupted, we try again to receive FDs, but without waiting (tvptr set to zero). Because of this, no large interruptions should occur -- but we also won't skip over message handling when flooded with signals. - After 100 retries after being interrupted, we empty rfds and continue to work, but print a warning. This is probably subject of discussion and _much_ too large: we're currenlty dumping the number of interruptions, and never had more than one at all before catching file descriptors (or none). - `int again` can and should be removed for live code, does not seem relevant at all (but for our own debug purpose). Kind regards from Lake Constance, Germany, Jens Am 06.02.2016 um 13:00 schrieb Bron Gondwana: > I could buy repeating the whole thing for EINTR but not for EAGAIN. > Obviously everything is fine under non-heavy load no matter what. > > I'm pretty sure your patch is wrong, because it will process the fdset that > the select was supposed to fill even if the select returned an error. > > You may have convinced me that my patch is wrong as well, and that we have to > always check for the in_shutdown and sigquit magic in that tight loop in case > it was a signal that needs to be responded to. > > Bron. > > On Sat, Feb 6, 2016, at 20:52, Jens Erat wrote: >> Hi Bron, >> >> thank you for the fast reply. >> >>> One thing I would suggest is running something like nginx in front of >> Cyrus, even if you only need it for authentication checks. >> >> We'll discuss this internally. >> >>> I believe this is the correct patch. If the select was interrupted >> then it needs to be re-run, but there is no need to repeat all the other >> work, so: >>> >>> - r = myselect(maxfd, &rfds, NULL, NULL, tvptr); >>> - if (r == -1 && errno == EAGAIN) continue; >>> - if (r == -1 && errno == EINTR) continue; >>> + do { >>> + r = myselect(maxfd, &rfds, NULL, NULL, tvptr); >>> + } while (r == -1 && (errno == EAGAIN || errno == EINTR)); -- Jens Erat Universität Konstanz Kommunikations-, Infomations-, Medienzentrum (KIM) Abteilung Basisdienste D-78457 Konstanz Mail: jens.e...@uni-konstanz.de
smime.p7s
Description: S/MIME Cryptographic Signature