(this is a discussion piece for talking about in tomorrow's meeting, which IS happening at the regular 10pm Melbourne time - that's now ANOTHER hour later for everyone due to timezones changing. I haven't written any code yet)
On top of Robert's work to support libicu for charset conversion and pick up all the rest of the character sets it supports, we need to make some cache format changes. I also have a user at FastMail with a 3.8 million message "Deleted Messages" folder, and I can't keep manually splitting giant folders for people just because their cyrus.cache file gets over 4 gig. So I'm proposing the following changes: cyrus.index version 14: cyrus.index header: * LAST_APPEND_DATE: 32 bit => 64 bit time_t * POP3_LAST_LOGIN: 32 bit => 64 bit time_t * LEAKED_CACHE: remove * FIRST_EXPUNGED: 32 bit => 64 bit time_t * LAST_REPACK_TIME: 32 bit => 64 bit time_t * HEADER_FILE_CRC: remove * RECENT_TIME: 32 bit => 64 bit time_t * POP3_SHOW_AFTER: 32 bit => 64 bit time_t * add UNIQUEID: 40 characters (enough space for a uuidgen UUID or whatever) * add a bunch of space for un-fixed-width quotaroot and flag names. By doing this, we no longer have a separate cyrus.header and cyrus.index. We only have ONE file in which facts are stored (except cyrus.annotations, but I have plans for that too). If the non-fixed data gets too big then we create a new file called cyrus.indexoverflow which contains just the non-fixed data. This is another 99%/1% case. In 99% of cases we won't create enough (i.e. long flag names) to fill the space. If we fix the header size at 2048 bytes, we save in the common case of an almost empty mailbox, while still working for huge mailboxes. There's a mailbox options flag to say to read from the indexoverflow file. ACL is no longer stored in this file. It's not a property of the mailbox in any meaningful way - it belongs out in mailboxes.db and the next layer up (eventually). mailboxname probably will get stored in the mailbox later, when we store on disk by uniqueid, but that's another yak to shave. cyrus.index record: * INTERNALDATE: change 32 bit => 64 bit time_t * GMTIME: change 32 bit => 64 bit time_t * SENTDATE: remove (moved to cache) * HEADER_SIZE: remove (moved to cache) * LAST_UPDATED: change 32 bit => 64 bit time_t * CONTENT_LINES: remove (moved to cache) * CACHE_CRC: remove (moved to cache) * CACHE_VERSION: remove (moved to cache) * Add: CACHE_FILE_NUMBER (32 bit) Basically I want to remove everything except GMTIME that's derived from the message out of cyrus.index. cyrus.index is about remembering FACTS about the mailbox which aren't available anywhere else. It's very important data. cyrus.cache is all re-creatable from the raw messages. The reason to keep gmtime is that it's quite common to SORT by sent date, and making that possible without loading cache is a worthwhile optimisation. ... cyrus.cache format changes: 1) there's a section in the unstructured data for CACHEACTIVE, which contains a list of (NUM VERSION FLAGS SIZE DIRTYBYTES) - probably binary encoded to save space as b32 b16 b16 b32 b32 => 128 bits per file. e.g. (3 5 0 1894322 1647) 2) each cyrus.cache file starts with the NUM VERSION FLAGS triple, and maybe even the SIZE and DIRTYBYTES as well, it wouldn't hurt to update them after appending new records. 3) each cyrus.cache record has structure: * CACHE_ITEM_LEN 32 bit * CACHE_VERSION 32 bit * SENTDATE 64 bit time_t * HEADER_SIZE 32 bit * CONTENT_LINES 32 bit * (existing fields with their individual structure) * <pad to multiple of 8 bytes> * CACHE_ITEM_CRC32 32 bit On disk the file names are cyrus.cache.N, e.g. cyrus.cache.3 New records are always added to the FIRST active cache file that matches the criteria of the record, aka if it's ARCHIVED then the first cache file with the ARCHIVE bit set. If a cache file gets too big (compile time option, probably 100 megabytes or so) then a new file with the next unused number gets created and added to the start of the list. During cyr_expire, if a cache file is more than a configured amount "dirty" then the records get copied to a newer file and their associated index records updated to the new locations. Once it's unreferenced, it can be safely deleted. During a normal repack, if most records are being kept, then the cyrus.cache files will be untouched, saving on IO. ..... This is all backwards compatible. Earlier cyrus.index versions will write just a single cache file. The upgrade and downgrade facilities will still work, and convert just fine. All the existing reading code will stay. I'll convert Robert's cache format change code to also be able to write the old style (or "unknown" if the charset isn't one of the ones with a numeric code) values for old cache files. Woohoo. No more 64 bit nastiness, reduced cache IO in the common case, and a savings of 4096 bytes (one file) per mailbox from the super-hot index location in the common case. Bron. -- Bron Gondwana br...@fastmail.fm