Re: (ITS#8475) Feature request: MDB low durability transactions

h . b . furuseth Fri, 16 Sep 2016 22:49:40 -0700

I'll use this ITS to summarize details for "volatile commits".
Hopefully I've managed to keep it all straight.


"Volatile" vs. "durable" are the most accurate names I can think of.
Not sure if that's more instructive than simply "soft" / "hard".

Description:

* Volatile commits omit fdatasync() without losing consistency.  To
  survive, they must be checkpointed *before* all processes close the
  env.  Un-checkpointed volatiles are lost when the env closes.

  Thus a separate checkpointing daemon can keep the env open to
  protect volatiles from application crash, at least if Robust locks
  are supported.  (The lmdb.h doc seems a bit unclear about Robust.)

* Checkpointing == committing a durable (non-volatile) write-txn.
  (If there is nothing to do, Commit writes nothing.)

  mdb_env_sync() will not checkpoint volatiles, since existing
  programs do not expect it to wait for the write mutex.  It
  "checkpoints" MDB_NOMETASYNC/MDB_NOSYNC.  Maybe mdb_checkpoint()
  will have a special case which obsoletes mdb_env_sync().

* Volatiles are unsupported with MDB_NOLOCK and pointless with
  MDB_NOSYNC.  OTOH it makes sense to enable MDB_NOMETASYNC.

* Volatiles need a bigger datafile, because it takes two durable
  commits to make a freed page reusable. (Plus awaiting old readers).

Configuration.  Too many options, ouch:

  LMDB can be configured to auto-checkpoint after X volatile commits
  and/or Y written kbytes(pages?).  Programs can also checkpoint every
  Z minutes(seconds?) - configured in LMDB to mimic Berkeley DB's
  "checkpoint <kbytes> <minutes>", but regular LMDB ops ignore that.

  The lockfile gives the current config.  An MDB_env could override
  for its particular process, e.g. with an MDB_NO_VOLATILE flag.
  Maybe resetting the lockfile should keep the previous config?
  OTOH I suppose MDB_meta can have default params the way it has
  mm_mapsize.  That survives a backup/restore.

Implementation - plain version first:

* Bump MDB_LOCK_FORMAT, MDB_DATA_VERSION (or make MDB_DATA_FORMAT).

* Keep 2 'MDB_meta's in the lockfile, for volatile commits.
  MDB_env.me_metas[] gets 4 elements: durable + volatile.
  
* mdb_env_open() throws away volatiles if it re-inits the lockfile.

* Add field MDB_meta.mm_oldest: 1 + (previous durable meta).mm_txnid
  in durable metas, and (previous meta).mm_oldest in volatile metas.

  Init 'oldest' in mdb_find_oldest() to new field MDB_env.me_oldest,
  which mdb_txn_renew0(write txn) sets to MDB_meta.mm_oldest.

  When are no volatiles, this ends up initing 'oldest' = same value
  as today.  Usually we could just have used 1 + (oldest durable
  meta).mm_txnid, but a failed write_meta() may have clobbered that.

* Replace MDB_txninfo.mti_txnid with mti_metaref = txnid*16 + P*4 + M:
    M = index to MDB_env.me_metas[],
    P = previous M during this session, initialized to the same as M,
  so we can get this info atomically.

  P may prove unnecessary, but it's simplest to just include it for
  now.  For when meta M fails a checksum so we want an older meta, for
  mdb_mutex_failed(), maybe so we can see if there are volatiles yet.

* Never use mdb_env_pick_meta() when the current metapage is known:
  Use it in mdb_env_open(), in mdb_mutex_failed(), and if MDB_NOLOCK.
  Or rather, I guess it gets a "force" param for those cases.

* Add config in the lockfile. Maybe per-env config overriding it and
  defaults in the datafile. Txn flags "prefer volatile", "checkpoint".
  
* Track the number of pages and volatiles since last durable commit.
  write_meta() compares with the config limits and makes the final
  decision of whether the new meta will be volatile.

  Add MDB_pgstate.mf_pgcount with #pages used so far.  The rest goes
  in a lockfile array[4] indexed by mti_metaref % 4, or in MDB_meta.
  That way, switching to next snapshot stays atomic - just update
  mti_metaref.

* txn_begin must verify metas, since we have no fdatasync barriers.
  Re-read and compare, or checksum.

  write_meta() and mutex_failed(): memory barrier between making
  a volatile meta and updating mti_metaref.  Most modern compilers
  have that.  Maybe a fallback implementation is lock;unlock an
  otherwise unused mutex.  Should also include CACHEFLUSH().
  
  It may make sense to have more than 2 volatile metas, so read-only
  txns will have more time to read a meta before it gets overwritten.

  MDB_WRITEMAP (and MDB_VL32?) has non-atomic issues we should deal
  with anyway.

* We can have (durable metapage).mp_pgno == (txnid & 1) as before:
  mdb_txn_renew0() steps txnid by 2 instead of 1 if 'meta' is volatile.
  But note that the txnid doesn't say if the snapshot is durable.
  
* "Try to checkpoint" feature, which does not await the write mutex:

  Trylock the write mutex in mdb_txn_begin().  If it fails, set a
  lockfile flag "Please checkpoint" and return.  Hopefully someone
  will obey and clear the flag.  mdb_env_commit(writer) does.

Variants:

* Put volatile MDB_metas in the datafile, behind the usual MDB_metas.

  That protects the metas from malicious/broken processes with
  read-only envs.  Otherwise, using the lockfile (or non-file shared
  memory) extended read-only envs' ability to hack/break the DB.
  
  Be careful to not read volatile MDB_metas that are older than last
  lockfile-reset, since the reset did not clear them.  That also means
  this variant does not enable volatiles with MDB_NOLOCK.

* Stay with MDB_DATA_VERSION = 1, no change in datafile format:

  - Volatiles share next durable txn's txnid (for freeDB keys), but
    put last durable txn's txnid in mti_metaref (for mdb_find_oldest).

  - mdb_txn_id() / MDB_envinfo.me_last_txnid can no longer be used to
    distinguish txns.

    Apps doing that could set an env flag "ignore volatiles", or txn
    flag "fail if current snapshot is volatile".

  - Define V = 2-bit sequence number incremented by commit(writer).
    Include V in MDB_txninfo.mti_metaref.  In volatile metas, include
    mm_metaref = copy of mti_metaref.

    This lets mutex_failed() figure out which MDB_meta is most recent:
    abs(V in mm_metaref - V in mti_metaref) is <= 1 whether mm_metaref
    or mti_metaref changes first in the thread.

* Support full 32-bit txnids on 32-bit hosts.

  mti_metaref eats some txnid bits in order to stay atomic, but we can
  get them back:

  On 32-bit hosts, make mti_metaref a 64-bit value - something like 
  ((txnid << 32) | R) where R = 32-bit metaref value described before.

  When reading mti_metaref, R is authoritative for the low txnid bits.
  If (txnid << 32) part does not match, adjust it so it does: That'll
  be + or - a small value.  mti_metaref's high bits vary slowly, so
  this is normally "atomic".  txn_renew0's loop re-reads it to verify.

* We can squeeze away some bytes and mti_metaref bits, but don't get
  sucked into spending time on that before the rest is working.

  E.g. the array[4] counting pages/volatiles can be just [2] if
  Commit always toggles the (mti_metaref & 1) bit.  And we can likely
  include some of the mti_metaref fields in the txnid.

Roads to Hell:

* Support checkpointing without waiting for the write-mutex.

  Programs which have used volatiles, may want this so they can exit
  quickly.  But it requires a new shared mutex for write_meta(), or
  some clever tricks I've thought of which would not quite work.

* Rescue volatiles in a dead lockfile when the user "knows" it's safe.

  Easy enough, just don't clear them when resetting the lockfile.
  But users screw up, and may then blame LMDB.  Users who want to
  screw up, can run use MDB_NOSYNC instead of volatile commits.

  Or if volatile metas are in non-file shared memory which a system
  crash will kill, it's _almost_ safe to not reset them along with the
  lockfile.  Unless someone unmounts/mounts the disk, or replaces the
  DB by overwriting it with another DB file, or who knows what else.

-- 
Hallvard

Re: (ITS#8475) Feature request: MDB low durability transactions

Reply via email to