Hi Jens, > after some weeks of running a stable service with this patch, I'd like > to propose merging this (we'd really like to switch back to an unpatched > Cyrus release in future).
This would be good. Can you attach it as a patch file please? CC me and I'll get it merged. Thanks for your hard work figuring this out. :) Cheers, ellie On Mon, Apr 11, 2016, at 05:52 PM, Jens Erat via Cyrus-devel wrote: > Hi all, > > after some weeks of running a stable service with this patch, I'd like > to propose merging this (we'd really like to switch back to an unpatched > Cyrus release in future). This is not only an issue to us, but we know > some other universities running similar Solaris setups and waiting to > upgrade to Cyrus 2.5 because of our issues. > > This is only an issue with Cyrus IMAP 2.5 and probably 3.0; Cyrus 2.4 > also does not properly repeat select if it failed, but it's not an issue > here as interrupts can occur anywhere. > > We didn't have those master loop hick ups any more. In fact, the while > loop even seems superfluous; just running myselect again if `r==0` would > be fine, but this might be different with other combinations of > operating system kernels, C libraries and compilers, and the loop seems > the safer thing to do. The limit of 100 retries seems a little bit > arbitrary, but I can't really argument for any number, something around > five should probably also be fine to cover all scenarios that are not > completely broken, anyway. > > We dumped the interrupted and again counters each time the master loop > ran (and several other metrics), and never had more than one EINTR or > EAGAIN in a row (so the next `myselect` immediately returned the FDSET). > > Kind regards from Lake Constance, Germany, > Jens > > Am 10.02.2016 um 13:39 schrieb Jens Erat via Cyrus-devel: > > Hi Bron, > > > > digging deeper, we realized there are still issues with the loop you > > proposed. `select` will only decrement tvptr on Linux (and maybe _some_ > > other systems, definitely not all of them), pselect will never do -- you > > might end up waiting forever here if nothing happens. > > > > On the other hand, we partially got a grasp of understanding what was > > going on. Because when `pselect` is available, signalling was disabled > > but during `pselect`, thus with some weird combination of load on the > > different services, each `pselect` returned with `EINTR` and the message > > handling code was never reached > > > > We ended up putting the whole deadline calculation in a queue, and have > > this code live at the moment: > > > > int interrupted = 0; > > int again = 0; > > do { > > /* how long to wait? - do now so that any scheduled wakeup > > * calls get accounted for*/ > > gettimeofday(&now, 0); > > tvptr = NULL; > > if (schedule && !in_shutdown) { > > double delay = timesub(&now, &schedule->mark); > > if (!interrupted && delay > 0.0) { > > timeval_set_double(&tv, delay); > > } > > else { > > tv.tv_sec = 0; > > tv.tv_usec = 0; > > } > > tvptr = &tv; > > } > > > > #if defined(HAVE_UCDSNMP) || defined(HAVE_NETSNMP) > > if (tvptr == NULL) blockp = 1; > > snmp_select_info(&maxfd, &rfds, tvptr, &blockp); > > #endif > > errno = 0; > > r = myselect(maxfd, &rfds, NULL, NULL, tvptr); > > > > if (r == -1) switch(errno) { > > /* Try again to get valid rfds, this time without blocking so we > > * will definitely process messages without getting interrupted > > * again. */ > > case EINTR: > > interrupted++; > > if (interrupted > 100) { > > syslog(LOG_WARNING, "Repeatedly interrupted, too > > many signals?"); > > /* Fake a timeout */ > > r = 0; > > FD_ZERO(&rfds); > > } > > break; > > /* Try again. */ > > case EAGAIN: again++; break; > > /* uh oh */ > > default: fatalf(1, "select failed: %m"); > > } > > } while (r == -1); > > > > We also dump interrupted and again together with some other variables at > > the end of each loop. The server is running fine for two days now, and > > the dumped information looks fine (we never experienced any delays >4 > > seconds since patching the server, we often found the master loop not > > completely running for minutes). I will report again if we didn't suffer > > from any interruptions for the next one or two weeks. > > > > Some remarks to the code: > > > > - We need to update the current timestamp `now` for decrementing the > > passed time. > > - `snmp_select_info` or's its FDs to rfds; thus it may be run repeatedly > > to update `tvptr` and not run over any SNMP events. > > - We are compiling without SNMP, thus this code is untested. > > - If interrupted, we try again to receive FDs, but without waiting > > (tvptr set to zero). Because of this, no large interruptions should > > occur -- but we also won't skip over message handling when flooded with > > signals. > > - After 100 retries after being interrupted, we empty rfds and continue > > to work, but print a warning. This is probably subject of discussion and > > _much_ too large: we're currenlty dumping the number of interruptions, > > and never had more than one at all before catching file descriptors (or > > none). > > - `int again` can and should be removed for live code, does not seem > > relevant at all (but for our own debug purpose). > > > > Kind regards from Lake Constance, Germany, > > Jens > > > > > > Am 06.02.2016 um 13:00 schrieb Bron Gondwana: > >> I could buy repeating the whole thing for EINTR but not for EAGAIN. > >> Obviously everything is fine under non-heavy load no matter what. > >> > >> I'm pretty sure your patch is wrong, because it will process the fdset > >> that the select was supposed to fill even if the select returned an error. > >> > >> You may have convinced me that my patch is wrong as well, and that we have > >> to always check for the in_shutdown and sigquit magic in that tight loop > >> in case it was a signal that needs to be responded to. > >> > >> Bron. > >> > >> On Sat, Feb 6, 2016, at 20:52, Jens Erat wrote: > >>> Hi Bron, > >>> > >>> thank you for the fast reply. > >>> > >>>> One thing I would suggest is running something like nginx in front of > >>> Cyrus, even if you only need it for authentication checks. > >>> > >>> We'll discuss this internally. > >>> > >>>> I believe this is the correct patch. If the select was interrupted > >>> then it needs to be re-run, but there is no need to repeat all the other > >>> work, so: > >>>> > >>>> - r = myselect(maxfd, &rfds, NULL, NULL, tvptr); > >>>> - if (r == -1 && errno == EAGAIN) continue; > >>>> - if (r == -1 && errno == EINTR) continue; > >>>> + do { > >>>> + r = myselect(maxfd, &rfds, NULL, NULL, tvptr); > >>>> + } while (r == -1 && (errno == EAGAIN || errno == EINTR)); > > > > -- > Jens Erat > Universität Konstanz > Kommunikations-, Infomations-, Medienzentrum (KIM) > Abteilung Basisdienste > D-78457 Konstanz > Mail: jens.e...@uni-konstanz.de > > Email had 1 attachment: > + smime.p7s > 7k (application/pkcs7-signature)