On Fri, Jun 10, 2016, at 10:15, Bron Gondwana via Cyrus-devel wrote:
> On Fri, Jun 10, 2016, at 06:40, Thomas Jarosch via Cyrus-devel wrote:
> > Hi Bron,
> > 
> > Am 08.06.2016 um 08:22 schrieb Bron Gondwana via Cyrus-devel:
> > > *THE PLAN[tm]***
> > >  
> > > For JMAP support, I'm going to discard the existing conversations DB and
> > > create a sqlite database per user which contains everything of value. 
> > 
> > one thing to watch out for with sqlite:
> > 
> > It doesn't scale easily with multiple processes accessing the same DB.
> > The write-lock timeout is short by default and a "modifying"
> > query might error out.
>
> Yeah, I know - which is why I've been locking around it with an exclusive
> file lock so only one process can hit it at a time.
> 
> You'd think that would ruin performance, but I haven't actually had too
> much trouble.  The conversations DB is already a per-user exclusive lock
> whenever you've got any mailbox open right now.

The more I think about this, the more I'm worried that it's a half-arsed 
solution.

I already knew it was a stopgap to allowing a fully stateless server.  To be 
able
to synchronously "backup" to another server means we need to cheaply sync
the state to some central server.

Which basically means log shipping.  You can do that pretty quickly with the
skiplist/twoskip format by just saving the end of the file each time, and having
the "restore from hard crash" process be a recovery2 on the file - walking the
file and applying every change while ignoring the pointers entirely.

But mixing that in with sqlite3 is tricky, and it's even trickier if you want to
change to another backend.

Sqlite "INTEGER" types also cost at least 8 bytes each, so you're already 
spending
a lot of space or you're still packing bitfields to store flag information.

So I think I'm going to throw away everything I've done so far, and go back to
basics:

* 1 database file per user (or per top-level shared folder name for non-user 
folders)
* 1 mailboxes database file for the server
* 1 temporary data file for the server (aka: delivery.db, tls_cache, etc) - 
these don't need to be durable

* optional: writeback to object storage for EVERYTHING on commit, so that you 
never lose data in any server crash situation

Let's break this back a little bit:

1 database file per user:

- actually this is probably a couple of files, because there's at least three 
very distinct classes of data:
* cache data
* emails
* index data (including annotations)
* multiple cache files per user - probably not even per-mailbox, but just for 
the entire user.  A repack strategy which keeps things in the order they're 
likely to be requested by clients.

It would be really nice to require indexes, but actually with a key-value 
format that allows prefix scans (cyrusdb_foreach) you can implement indexes 
very easily.  Sure it's more work than just writing SQL, but with transactions 
it's just as reliable if the code is good.  We'll be reconstructing those files 
in audit mode enough to be sure of that I hope :)

...

If twoskip is too slow (possible), then I've been quite interested in looking 
at rocksdb (http://rocksdb.org/) as an embedded engine that has really good 
performance, prefix scanning, and a good community around it.  It's also quite 
compatible with object storage because all but the level0 "hot" databases are 
read-only, so you can store them as objects once and then not need to scan them 
again.

An alternative there is multi-level databases in the same way we have the 
search tiers - with offline repack and substituting a new database with 
identical contents (minus dead records) atomically in the way that we do it 
with search.  This eliminates the stop-the-world repacks that occasionally hit 
us with both cyrus.index/cyrus.cache and all the twoskip/skiplist databases, 
because
repack can be done in the background to new read-only files, with all writes 
happening to a small level0 database.

We already kinda do this in-memory for cyrus.index now, with a hash that gets 
looked up for every read.

And that's about where my thinking is :)  It's more work now, but it gets us to 
a fully object-storage-backable system a lot faster.  We could then have 
replication mainly be used to trigger a pull from object storage and heating of 
the same files so that failover was clean.

...

I still want a global mailboxes state database, which would be a distributed 
database rather than the current murder arrangement.  This is in ADDITION to 
the per-machine mailboxes.db, and would be read-only, along with a locking 
service which pinned each user/top-level-shared to a single machine in the 
cluster and a way to transfer individuals locks or bulk blocks of locks between 
machines as failover.  Something like etcd/consul seems the right choice here.  
This is definitely phase2, I'm just keeping it mind as I design this change.

It is a massive change to the on-disk data formats!  We'd be left with 
basically:

* key value stores
* cache format (multiple fixed-length binary items per file with file number + 
offset addressing)
* rfc822 messages (either stick with one-file-per-message or do some MIX style 
multiple-per-file - this can be independent)

By making every database a key-value store (including the DAV databases - I 
would subsume them into the userdb) there's only the two data formats to even 
care about backing up - and there are tons of distributed key-value stores that 
could already be plugged in directly through the cyrusdb interface if you 
wanted to!

Bron.

-- 
  Bron Gondwana
  br...@fastmail.fm

Reply via email to