Re: MESSAGE quota resource implemention
Actually, really I'd like to create a new UNIQUEID - and store all the files in paths based on uniqueid rather than on folder name. This would not only mean renames could be fast and atomic, but that delayed delete would be fast. The downside is a more opaque filesystem layout. Oh, another upside - file path limitations don't exist so much any more. While UNIQUEID is nice, the opaqueness is annoying. Personally, I liked the idea that we talked about a while back which I think was: $spooldir/a/user/aardvark/user.aardvark/ $spooldir/a/user/aardvark/user.aardvark.drafts/ $spooldir/a/user/aardvark/user.aardvark.trash/ $spooldir/a/user/aardvark/user.aardvark.foo/ $spooldir/a/user/aardvark/user.aardvark.foo.bar/ $spooldir/a/user/aardvark/user.aardvark.abc.xyz/ So you end up with every folder for a user in one dir. This solves the current messy handling of sub-dirs (eg currently you have to create the intermediate dir /abc/ even though there's no entry in mailboxes.db for it), and makes renaming any folder very cheap still (because you can do it with a single dir rename, rather than having to move each message file), but you don't go completely opaque. Delete handling is still easier, just rename to DELETED.$oldname.$UNIQUEID or something like that, because it's cheap to rename anyway. Of course it means re-organising folder layout for every installation out there, but maybe we need to bump major versions anyway, cyrus 3.0 here we come :) Rob
Re: Zero byte appends
Some applications (e.g. Kolab) store other objects than Mail (e.g. Addressbooks, Calendar, Tasks or Notes) in IMAP folders. They might create empty objects for some reason. Well that's pretty evil. The objects should still have a header, right? Or do they just store blobs? http://www.faqs.org/rfcs/rfc2822.html So the body is optional (3.5), but in the headers (3.6), there has to be a Date and From header The only required header fields are the origination date field and the originator address field(s). All other header fields are syntactically optional. More information is contained in the table following this definition. So really a 0 byte message is not a valid message. Rob
Re: about the replication
I run quota on master and replica every once in a while and compare. That's an easy way to find and fix the problem cases, regardless of whether the master or replica was at fault. There are typically only about half a dozen cases each time, so I don't think that it's a huge problem. I agree, it's not a huge problem, and there are other ways to fix it. However coming back to my main point, shouldn't sync_client -u be guaranteed to make a replica be in sync with the master for a given user for everything; folders, subscriptions, flags, seen state, sieve script, quotas, etc. Currently it seems to do everything but quotas, which is annoyingly inconsistent! Rob
Re: about the replication
I ran my script last night on one of the back-ends. This one has 20928 mailboxes. And 1067 are not in sync. From those, 897 users didn't have all their mailboxes replicated to the replica. The others had the same folder count, but one or more messages were not the same (or weren't there). We also have a script which does this for all users every week. It checks for each user that the following match: 1. Folder list 2. Folder subscriptions 3. Quota (total + used) 4. Sieve scripts 5. For every folder, status output for (messages uidnext unseen recent uidvalidity) 6. For every message in every folder, flags + message sizes + GUID All of those are reasonably quick since they access only meta-data. We also check on a regular basis for a random sampling of messages that rfc822.sha1 on both sides match. The script also waits for a few seconds and retries if there's any problems (replication is non-synchronous, so there may be delays in differences between both sides). If there's a mismatch problem, the script can be set to: 1) Run a sync_client -u 2) Run a reconstruct on both sides 3) Completely delete the replica mailbox and run a sync_client -u to build a complete new copy from scratch One annoying thing we've found recently is that sync_client -u doesn't fix quota problems, it seems the protocol doesn't include an absolute set replica quota to this action. We'd been meaning to look into that but hadn't got to it. David: Don't suppose you could check that out? It would be nice if sync_client -u could really fix every user problem, it's very close to that right now, quota mismatches seems to be about the only thing it can't fix. FYI, unfortunately at the moment this script is fairly closely tied to our system (it uses a bunch of other modules to work out master/replica servers for a user). It probably would be nice to factor that stuff out so others could use it... Rob
Re: statuscache
- statuscache v.2 stores statuscache_data as a binary blob, which is platform dependent (byte-order word size). This make statuscache.db non-portable. I believe that this might be the first cyrusdb that would be non-portable. Does anybody care? The only tricky part is going to be upgrading from v.2 to v.3. If its not designed to be persistent and/or migrated between platforms, I'm not sure I care, since your design is definitely quicker. Basically our startup scripts completely delete the statuscache.db file on startup, so migrating isn't an issue for us at all. So if you want to make it platform independent for others in the future, I'm fine with that. BTW, is statuscache:log necessary, or is it just for debugging? It's not necessary, it was just helpful for debugging rather than having to recompile everything. From memory, clients make SO many status calls, that it can completely flood the log if you leave the syslog calls in. Rob
Re: statuscache
Why do you delete the cache? I doesn't appear from the code that the cache entries would be invalid if the mailbox and seen state aren't touched. Historical. We only use berkeley db for statuscache and duplicate delivery, and because of problems in the past with berkeley db's corrupting themselves, we found it easier to just delete them on startup. It avoids the case of cyrus failing to start if one of the bdb's is corrupted I'm changing the statuscache option to a switch and leaving the logging in as LOG_DEBUG My only concern with this is that it can cause a LOT of log messages. I've never been entirely sure how well syslog handles the case of lots of log messages. I think it still has to send them over the socket to the syslogd. Most people then have a syslog.conf that discards debug messages, so it always seemed like a possible way to overflow the syslog socket buffer and loose important messages (i'm pretty sure syslog never guarantees that a message will be logged, which is why djb implemented his own way that uses streams rather than dgrams). This is probably all entirely academic and not a real issue... Rob
Two bugs in 2.3.10-rc2 that cause segfaults
Hi Ken After some testing on our test server, we rolled out 2.3.10-rc2 onto one of our production IMAP stores and almost immediately starting noticing segfaults for some users. Bron and I spent quite a bit of time this afternoon debugging, and we're pretty sure we found both the causes. 1. In index_parse_sequence(), your realloc() doesn't multiply by sizeof(struct seq_range), so if you have a sequence list with more than about 8 items, it will start corrupting memory 2. The signed - unsigned changes in the prot stream stuff weren't as innocent as they looked and created a subtle and nasty bug. You might want to audit the other code where you changed signed - unsigned as well... Here's an example of the problem if you use IDLE and close the connection on it. In that case, what happens is: In cmd_idle(), we wait for data from the client by calling: c = getword(imapd_in, arg); getword() calls prot_getc(), which looks like this: #define prot_getc(s) ((s)-cnt-- 0 ? (int)*(s)-ptr++ : prot_fill(s)) Say there's no data in the prot buffer, then cnt == 0. However when we call the above macro, it does cnt--, which makes cnt wrap to 4294967295, and then calls prot_fill. prot_fill() calls read(), and if read() returns = 0 (eg other end closed the connection) it does. if (n = 0) { if (n) s-error = xstrdup(strerror(errno)); else s-eof = 1; return EOF; } So it sets the eof flag and returns EOF. Back in cmd_idle, we get the EOF, and exit out of cmd_idle and back into the main loop, which does. /* Parse tag */ c = getword(imapd_in, tag); Now getword calls the prot_getc() macro again... but uh, oh. cnt is definitely 0 now (it's 4294967295) and so we reading through unitialised memory trying to interpret it as IMAP commands until we hit a segfault somewhere. Our fix is to change prot_getc() to only decrement cnt if it's non-zero. Because it's an expression macro, it uses the C comma operator which I've found often even experienced C people don't know about (http://www.eskimo.com/~scs/cclass/int/sx4db.html). #define prot_getc(s) ((s)-cnt 0 ? (--(s)-cnt, (int)*(s)-ptr++) : prot_fill(s)) Both patches are here: http://cyrus.brong.fastmail.fm/patches/cyrus-fix-segfaults-2.3.10.diff Rob
Re: Making Replication Robust
This would seem to be a significant advantage of running sync_client outside master. When I shut down master, sync_client continues to process the outstanding log. I can then use sync_shutdown_file when it has finished and is idle. We do something similar. But it means you have to develop a bunch of your own infrastructure to make cyrus replication robust. It's not currently a start it and it just works until you shut it down solution, which means either people have to replicate the same extra infrastructure work everywhere separately, or people are going to get burnt not realising that what they're doing isn't safe. That's why I'd really like this to be in cyrus itself. I think we should be able to say in the documentation something like: Shuting down a cyrus master with a SIGQUIT ensures that all actions have been replicated to the replica side. It makes writing init scripts and the like a lot easier. There seems to be a spilt of opinion between BSD and SVR4: BSD tries to retry while SVR4 throws EINTR. Linux of course can work either way: http://www.gnu.org/software/libc/manual/html_node/Interrupted-Primitives.html Isn't a lot of writes already wrapped up in some retry_write() function. I admit I haven't looked closely. Anyway, is this really a problem. Basically shouldn't you be able to kill cyrus at any point, and files are left in a consistent restartable state? If so, if something returns EINTR, won't it just move on and eventually exit? Or is the problem that you have something like: write to file 1 write to file 2 And if the first returns EINTR but is ignored, and then it writes the complete data to the second, things are in an inconsistent state? Rob
Re: Making Replication Robust
Or is the problem that you have something like: write to file 1 write to file 2 And if the first returns EINTR but is ignored, and then it writes the complete data to the second, things are in an inconsistent state? This is my concern. Doing an ack 'write\(' reveals a scary mix of write, retry_write and fwrite calls. My initial reaction was that binary files seem to use open/retry_write, and text files use fopen/fwrite, but doesn't quite seem to be the case... mailbox.c 1242:r = write(newheader_fd, MAILBOX_HEADER_MAGIC, 1359:n = retry_write(mailbox-index_fd, buf, header_size); 1428:n = retry_write(mailbox-index_fd, buf, INDEX_RECORD_SIZE); 1477:n = retry_write(mailbox-index_fd, buf, len); 1642:fwrite(buf, 1, INDEX_HEADER_SIZE, newindex); 1659:fwrite(bufp, INDEX_RECORD_SIZE, 1, newindex); 1710:fwrite(buf, INDEX_RECORD_SIZE, 1, newindex); 1721: fwrite(buf+OFFSET_DELETED, 1952: n = retry_write(expunge_fd, buf, mailbox-record_size); 1979: if (newindex) fwrite(buf, 1, mailbox-record_size, newindex); 1999: /* fwrite will automatically call write() in a sane way */ 2000: fwrite(cacheitembegin, 1, cache_record_size, newcache); 2004: fwrite(buf, 1, mailbox-record_size, newindex); 2058:fwrite(buf, 1, mailbox-start_offset, newindex); 2215: fwrite(buf, 1, sizeof(bit32), newcache); 2219:fwrite(buf, 1, mailbox-start_offset, newindex); 2263: n = retry_write(expunge_fd, buf, mailbox-start_offset); 2342: r = quota_write(mailbox-quota, tid); 2363: fwrite(buf, 1, mailbox-start_offset, newexpungeindex); 2424: n = retry_write(expunge_fd, buf, mailbox-start_offset); 2719: n = retry_write(mailbox.cache_fd, (char *)mailbox.generation_no, 4); 2823: r = quota_write(mailbox-quota, tid); 3056: r = quota_write((newmailbox-quota), tid); 3309: r = quota_write(newmailbox.quota, tid); 3319: r2 = quota_write(newmailbox.quota, tid); 3398:n = retry_write(destfd, src_base, src_size); It seems mixing up fd's or FILE * structs all over the place. *sigh* Does fwrite() retry a write on EINTR? It looks like that's the whole point of retry_write() anyway. If fwrite() does retry, then about the only other work would be changing any naked write() calls to retry_write(), which actually doesn't seem that many. Thoughts? Rob
Re: Making Replication Robust
c) MUST have a clean process to soft-failover to the replica machine, making sure that all replication events from the ex-master have been synchronised. Something more than sync_shutdown_file plus automatic retries on recent work files? I think the problem at the moment is that the process you really want is: 1. Stop new imap/pop/lmtp/sieve/etc connections 2. Finish and close existing connections cleanly but as quickly as possible 3. Finish running any sync log files 4. Fully shutdown There's currently no clean way to do this. Basically you have to SIGTERM master which hard kills it and all children, then manually run sync_client -f on any remaining log files. We've got a patch which makes master handle SIGQUIT much more nicely. Basically it appears there was some existing infrastructure that was designed to handle a cleaner shutdown, look at the code to all the places that call signals_poll(). It looks like the idea was that you could send child processes SIGQUIT and they would continue their current action until their main loop and check if they'd been sent a QUIT, and then exit cleanly. Unfortunately if you sent SIGQUIT to master, it would just SIGTERM all children, not SIGQUIT them. This patch attempts to fix this, so that sending SIGQUIT to master, sends SIGQUIT to all children, and then waits for them to all exit cleanly. http://cyrus.brong.fastmail.fm/#cyrus-clean-shutdown-2.3.8.diff This solves step 1 2 above, though it doesn't deal with the case of a crazy child that doesn't respond to SIGQUIT. Personally our init script sends SIGQUIT, and if the master process is still there after 10 seconds, then it sends SIGTERM to force and exit. In general we find that everything exits after a couple of seconds of SIGQUIT. To do step 3, I think the best might be to have a new cyrus.conf section, a SHUTDOWN section which gives some commands to run on shutdown. Basically after all children have accepted a SIGQUIT and exited, then we run the SHUTDOWN section, which would run a final sync_client -r on the sync dir to finish up any remaining log files. With all of that in place, it means you could send a SIGQUIT to a cyrus master process on a master server, and it would cleanly shutdown all children and ensure that all replication events have been correctly played to the replica. You could then do the same to the replica, then reverse their roles, and bring them both back up and you've got a safe soft failover. At the moment we replace messages (on the master knows best principle). It would be easy enough to leave message in place and generate warnings instead, although this would generate a lot of warnings, one for every bad message every time that a given mailbox is updated. That's what this patch does. http://cyrus.brong.fastmail.fm/#cyrus-warnmismatcheduuids-2.3.8.diff In theory with clean soft failovers, you should NEVER have UIDs with mismatched UUIDs. After a hard failover, you obviously might, but in those cases, just replacing the message means we're almost certainly overwriting a delivered message and loosing it which is bad. At least making it an option to overwrite or log I think is a sane idea. My nightmare scenario is a replication engine which carries on running in the face of mboxlist corruption on the master: you could lose a lot of mailboxes on the replica that way. That would be bad, though hard to detect and stop. I guess that's what backups are for... It would be easy enough to generate multiple replication log files. MySQL keeps a single transaction log for multiple replicas, but that file contains quite a lot of information about each transaction. In contrast the Cyrus sync log is just a list of objects we need to pay attention to: the files have much less state, particularly without duplicates. The other option is rather than using the rotate log, play it, delete it system, you generate one log file but you keep track of offsets within the file to tell you where each replica is up to. That's what mysql does, so you can have multiple replicas because each replica is playing off the same log files, they're just up to different offsets at any point in time. Rob
Re: Two bug fix patches
Eventually worked out that if a message is delivered by LMTP to several recipients, then the first recipient gets a correct ENVELOPE, but all the following recipients get an lcase()ed version. The reason is that append_fromstage() in 2.3 reuses the struct body calculated for the first recipient. Unfortunately there are some lcase() calls in the middle of message_write_cache() which overwrite fields in that structure. The incorrect cyrus.cache entries are replicated as is by sync_client UPLOAD PARSED, so master and replica are typically both incorrect. I'm running a mailstore with fastmail.fm patch for uuidmode: md5. Should I worry? It seems to me I should. I don't think so. This only affects cache records. Our uuidmode patch only calculates the md5 for the message body. The make_md5 David mentioned is a completely separate script he runs and maintains to calculate the md5's of messages and stores the values separately to the existing cyrus structures. At least I believe that's the case, I'm sure David will comment. Rob
Re: *IMPORTANT* - bugfix sync_append_commit index breakage
I believe so. If we also used it for appends, we wouldn't run into the problem that Fastmail is seeing. But, it may be too much of a performance hit. I don't think that any of the code does the .NEW for appends because that makes append O(N) where N is the mailbox size, and that would suck for huge mailboxes. Definitely. Rewriting the cyrus.index file for every message delivered to a mailbox would be horrible, especially when some users have 100,000+ messages in a mailbox. Speaking of which, is there any reason why cyrus.expunge isn't sorted by UID? We have to rewrite the entire cyrus.index each time we do an expunge anyway. I'm guessing it's because messages can be deleted + expunged in any order, so it's easy just to append the record for any message being expunged to the end of the cyrus.expunge file. If you had to keep it sorted, that would mean every time you expunged a message, you'd have to reread, resort, and rewrite the cyrus.expunge file. Generally the cyrus.expunge will be a lot smaller than the cyrus.index, but still annoying... Rob
Re: Need advice: mailbox-based \Seen flag
I actually went digging into seen state and cyrus.index a little today for another reason, and I realized that we wouldn't have to expand/upgrade the index file to accomodate a shared seen state. The existing field for system flags is just a bitmask, so we could easily add a bit for shared \Seen. We could also implement the sharedseen annotation as a mailbox flag in the index header, just like condstore. Based on other changes that I'm thinking about, I'm starting to like this idea better, even thought it *may* involve more code. Doesn't that just seem annoying having two *completely* different ways of storing seen state? There' probably quite a few places in the code that look for the seen state. It just seems much easier to me to have a single flag that controls whether to open a specific users seen db, or the anyone user seen db, rather than having to special case lots of code around the place for either pulling from the seen db, or looking up a bit in the index file. Rob
Re: Various small patches
lmtpd - deliver: $msgid$, from=$from$, to=$to$, uid=$uid$, mailbox=$mboxname$ lmtpd - sieve discard: $msgid$, from=$from$, to=$to$ lmtpd - sieve redirect: $msgid$, from=$from$, to=$to$, redirect=$redirect lmtpd - sieve error: $msgid$, from=$from, to=$to$, error=$error$ imapd - expunge: $msgid$, uid=$uid$, mailbox=$mboxname$ Does something like that sound reasonable? Yes, although I doubt that you want imapd/ipop3d to log the UID of every single message which is expunged. Given we're suggesting logging the msgid, from, to, mailboxname and uid of every message delivered, I don't see too much of a problem in logging the msgid, mailbox name and uid of every message expunged, it should be a pretty similar order of magnitude and would provide very useful tracking information. Rob