Dear Cyrus maintainers, the very short version: we probably found an issue resulting in the master stalling high-volume IMAP servers. A possible patch is attached, but needs some more discussion.
- - - At the University of Konstanz, we're running a rather large IMAP server for about 18k users on some large Oracle/Solaris machines. We don't have a murder/cluster. The machine is overprovisioned to last for some years and is running blazingly fast during normal operation. We're running on Solaris 11 and ZFS, the mailboxes are stored on a network storage attached through Fiberchannel (and we don't observe any I/O issues). The dual socket machine is equipped with 256GB of memory. We tried both compiling with GnuCC 4.8 as well as Sun CC (from Solaris Studio). Then, from time to time, it completely stalled -- usually at times with high fork rate. We observed this issue to happen in the morning when lots of people started reading their mails, we experienced it during a DDOS attack on our network, which made the firewall drop lots of connections (and the mail clients trying to reconnect instantly). As a result, the mail server was still running fine for everybody still holding a connection, but denied service for pretty much all new connecting clients. We had to restart the whole Cyrus IMAPd service to recover from this issue. We started getting a clue of what's going wrong when we actually added debug information and some ctable verifications, as we expected something would be wrong with how the active/ready workers are counted. In the end, we also took time checkpoints during the master loop to see _where_ it is stuck. In the end, it seemed, this was while jumping back from the very last command (getting a timestamp in the loop) to the very first (also getting a timestamp). The code pretty much looked like: struct timeval profiling[11]; gettimeofday(&profiling[10], 0); // LOOP DONE for (;;) { gettimeofday(&profiling[0], 0); // INIT // Start scheduled tasks gettimeofday(&profiling[1], 0); // SCHEDULE // Other master loop tasks: spawn, message handling, ... gettimeofday(&profiling[9], 0); // SNMP_ALARMS syslog(LOG_INFO, "MASTERLOOP_PROFILING: %f,%f,%f,%f,%f,%f,%f,%f,%f,%f", timesub(&profiling[0], &profiling[1]), timesub(&profiling[0], &profiling[1]), timesub(&profiling[1], &profiling[2]), timesub(&profiling[2], &profiling[3]), timesub(&profiling[3], &profiling[4]), timesub(&profiling[4], &profiling[5]), timesub(&profiling[5], &profiling[6]), timesub(&profiling[6], &profiling[7]), timesub(&profiling[7], &profiling[8]), timesub(&profiling[8], &profiling[9])); gettimeofday(&profiling[10], 0); // LOOP DONE } The results really puzzled us. What might be the reason jumping back from the end of a loop to the beginning took multiple minutes? In the end, by adding another log line in the _beginning_ of the loop we realized that the loop was indeed running very often -- but simply did never complete. `profiling[10]` did never change when the server was stuck. Knowing this the problem was obvious: when `select`ing the sockets and waiting for messages, the server got interrupted all the time. This results in the whole loop starting over from scratch, passing over message handling and thus accounting of available worker daemons, too few of them getting spawned. r = myselect(maxfd, &rfds, NULL, NULL, tvptr); if (r == -1 && errno == EAGAIN) continue; if (r == -1 && errno == EINTR) continue; if (r == -1) { /* uh oh */ fatalf(1, "select failed: %m"); } This is a common pattern, if you want to `sleep` or `select`: try to do it, and if you get interrupted or informed to try again, `continue` and do again. But usually, this is enclosed in a tiny loop. On a test machine, we were not able to completely reproduce the issue, but have very well been able to observe the message handling part of the loop not running for a dozen or more master loop runs. A small patch replacing this with going on with the current loop instead of starting over from scratch completely resolved the issue (for `master/master.c` in Cyrus IMAP 2.5.7 release tar): 2481,2483c2481 < if (r == -1 && errno == EAGAIN) continue; < if (r == -1 && errno == EINTR) continue; < if (r == -1) { --- > if (r == -1 && !(errno == EAGAIN || errno == EINTR)) { This ran fine on our test setup, but we're still scared patching one of the most central flows of logic in a very relevant service. What is your opinion: - Message handling was sometimes stuck for minutes. I guess we can agree this should never happen on a high volume server. - Might this be related to the issues we observed? - Are there any consequences to the subsequent code (message handling and reaping children) if we do _not_ start from scratch here? If we get a "this looks fine, and you won't horribly mess up" from the developers, we will have a try and patch the production machine as proposed. Kind regards from Lake Constance, Germany, Jens Erat -- Jens Erat Universität Konstanz Kommunikations-, Infomations-, Medienzentrum (KIM) Abteilung Basisdienste D-78457 Konstanz Mail: jens.e...@uni-konstanz.de
smime.p7s
Description: S/MIME Cryptographic Signature