Re: cyrus.index version 14 and cyrus.cache upgrades

Bron Gondwana via Cyrus-devel Mon, 04 Apr 2016 03:19:26 -0700

Update - happening at 9pm Melbourne time!  I see I set it to be the same UTC 
always.


On Sun, Apr 3, 2016, at 22:21, Bron Gondwana via Cyrus-devel wrote:
> (this is a discussion piece for talking about in tomorrow's meeting,
> which IS happening at the regular 10pm Melbourne time - that's now
> ANOTHER hour later for everyone due to timezones changing.  I
> haven't written any code yet)
> 
> On top of Robert's work to support libicu for charset conversion and
> pick up all the rest of the character sets it supports, we need to make
> some cache format changes.
> 
> I also have a user at FastMail with a 3.8 million message "Deleted
> Messages" folder, and I can't keep manually splitting giant folders for
> people just because their cyrus.cache file gets over 4 gig.
> 
> So I'm proposing the following changes:
> 
> cyrus.index version 14:
> 
> cyrus.index header:
> 
> * LAST_APPEND_DATE: 32 bit => 64 bit time_t
> * POP3_LAST_LOGIN: 32 bit => 64 bit time_t
> * LEAKED_CACHE: remove
> * FIRST_EXPUNGED: 32 bit => 64 bit time_t
> * LAST_REPACK_TIME: 32 bit => 64 bit time_t
> * HEADER_FILE_CRC: remove
> * RECENT_TIME: 32 bit => 64 bit time_t
> * POP3_SHOW_AFTER: 32 bit => 64 bit time_t
> * add UNIQUEID: 40 characters (enough space for a uuidgen UUID or
>   whatever)
> * add a bunch of space for un-fixed-width quotaroot and flag names.
> 
> By doing this, we no longer have a separate cyrus.header and
> cyrus.index. We only have ONE file in which facts are stored (except
> cyrus.annotations, but I have plans for that too).
> 
> If the non-fixed data gets too big then we create a new file called
> cyrus.indexoverflow which contains just the non-fixed data.  This is
> another 99%/1% case.  In 99% of cases we won't create enough (i.e. long
> flag names) to fill the space.  If we fix the header size at 2048 bytes,
> we save in the common case of an almost empty mailbox, while still
> working for huge mailboxes.
> 
> There's a mailbox options flag to say to read from the
> indexoverflow file.
> 
> ACL is no longer stored in this file.  It's not a property of the
> mailbox in any meaningful way - it belongs out in mailboxes.db and the
> next layer up (eventually).
> 
> mailboxname probably will get stored in the mailbox later, when we store
> on disk by uniqueid, but that's another yak to shave.
> 
> 
> cyrus.index record:
> 
> * INTERNALDATE: change 32 bit => 64 bit time_t
> * GMTIME: change 32 bit => 64 bit time_t
> * SENTDATE: remove (moved to cache)
> * HEADER_SIZE: remove (moved to cache)
> * LAST_UPDATED: change 32 bit => 64 bit time_t
> * CONTENT_LINES: remove (moved to cache)
> * CACHE_CRC: remove (moved to cache)
> * CACHE_VERSION: remove (moved to cache)
> * Add: CACHE_FILE_NUMBER (32 bit)
> 
> Basically I want to remove everything except GMTIME that's derived from
> the message out of cyrus.index.  cyrus.index is about remembering FACTS
> about the mailbox which aren't available anywhere else.  It's very
> important data.
> 
> cyrus.cache is all re-creatable from the raw messages.
> 
> The reason to keep gmtime is that it's quite common to SORT by sent
> date, and making that possible without loading cache is a worthwhile
> optimisation.
> 
> ...
> 
> cyrus.cache format changes:
> 
> 1) there's a section in the unstructured data for CACHEACTIVE,
>    which contains a list of (NUM VERSION FLAGS SIZE DIRTYBYTES) -
>    probably binary encoded to save space as b32 b16 b16 b32 b32 =>
>    128 bits per file.
> 
>    e.g. (3 5 0 1894322 1647)
> 
> 2) each cyrus.cache file starts with the NUM VERSION FLAGS triple, and
>    maybe even the SIZE and DIRTYBYTES as well, it wouldn't hurt to
>    update them after appending new records.
> 
> 3) each cyrus.cache record has structure:
>    * CACHE_ITEM_LEN 32 bit
>    * CACHE_VERSION 32 bit
>    * SENTDATE 64 bit time_t
>    * HEADER_SIZE 32 bit
>    * CONTENT_LINES 32 bit
>    * (existing fields with their individual structure)
>    * <pad to multiple of 8 bytes>
>    * CACHE_ITEM_CRC32 32 bit
> 
> 
> On disk the file names are cyrus.cache.N, e.g. cyrus.cache.3
> 
> New records are always added to the FIRST active cache file that matches
> the criteria of the record, aka if it's ARCHIVED then the first cache
> file with the ARCHIVE bit set.
> 
> If a cache file gets too big (compile time option, probably 100
> megabytes or so) then a new file with the next unused number gets
> created and added to the start of the list.
> 
> During cyr_expire, if a cache file is more than a configured amount
> "dirty" then the records get copied to a newer file and their associated
> index records updated to the new locations.  Once it's unreferenced, it
> can be safely deleted.
> 
> During a normal repack, if most records are being kept, then the
> cyrus.cache files will be untouched, saving on IO.
> 
> .....
> 
> This is all backwards compatible.  Earlier cyrus.index versions will
> write just a single cache file.  The upgrade and downgrade facilities
> will still work, and convert just fine.  All the existing reading code
> will stay.
> 
> I'll convert Robert's cache format change code to also be able to write
> the old style (or "unknown" if the charset isn't one of the ones with a
> numeric code) values for old cache files.
> 
> Woohoo.  No more 64 bit nastiness, reduced cache IO in the common case,
> and a savings of 4096 bytes (one file) per mailbox from the super-hot
> index location in the common case.
> 
> Bron.
> 
> 
> -- 
>   Bron Gondwana
>   br...@fastmail.fm


-- 
  Bron Gondwana
  br...@fastmail.fm

Re: cyrus.index version 14 and cyrus.cache upgrades

Reply via email to