For informaion only - Nowadays 'lifo' and 'coalesce' features implemented in ReOpenLDAP fork. 1) lmdb: MDB_LIFORECLAIM & MDB_COALESCE modes. https://github.com/ReOpen/ReOpenLDAP/commit/829c2063b602238b5c93ea36a981de3= d0d7994bc 2) lmdb-backend: support config for 'lifo' and 'coalesce' envflags. https://github.com/ReOpen/ReOpenLDAP/commit/08b4a41b5b837548444ef0fef761494= 0c41d882a
With the couple of issues: 1) lmdb in 'writemap' mode may inconsistent even with checkpoints https://github.com/ReOpen/ReOpenLDAP/issues/1 2) lifo feature should be synchonized with checkpoints https://github.com/ReOpen/ReOpenLDAP/issues/2 However currently it gives a reasonable boost (5-10 times) of write-performance in our use case. Leonid. 2014-10-20 0:27 GMT+03:00 =D0=9B=D0=B5=D0=BE=D0=BD=D0=B8=D0=B4 =D0=AE=D1=80= =D1=8C=D0=B5=D0=B2 <[email protected]>: >>>> We are using the combination: >>>> envflags writemap nosync lifo >>>> checkpoint 0 1 >>>> >>>> If the checkpoint is set in seconds, it gives us the assurance >>>> consistent state database on disk. >>>> However, without this patch meta-pages can be written by the kernel >>>> before the data. >>>> >>>> In fact, for a full guarantee in case of death slapd process, >>>> meta-page should be written explicitly. >> >> >> No, the DB can never go inconsistent due to a process crashing - the pag= es >> in OS cache are always correct. It can only go inconsistent if the OS >> crashes and a proper sync has not occurred. > > Yes, Howard, you are right. > But apparently I need to be more precise. > Talking about "death" of slapd, I meant all the reasons, including power = off. > > For example, a power-off case: > - The main power is turned off and the system switches to the UPS. > - Given the notice, OS starts an emergency stop processes. > - For some reason (does not have enough time to stop) slapd receives SIGK= ILL. > - OS tries to write mmap-region of the DB-file and begins with the > lower address. > - let the meta-pages has written completely, but for the rest of the > data is not enough battery power. > - Now DB is completely destroyed on the disk. > > To avoid this, the meta-pages should not be included in the rw-mapped > region, and should be written explicity after a data pages. > > >>>>>> commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae >>>>>> Author: Leo Yuriev <[email protected]> >>>>>> Date: 2014-09-19 22:47:19 +0400 >>>>>> >>>>>> BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env(). >>>>>> >>>>>> Meta-pages may be updated during data-syncing in mdb_sync_env= (), >>>>>> in this case database would be inconsistent. >>>>>> >>>>>> Check-and-retry if lead txn-id changed during flushing data i= n >>>>>> mdb_sync_env(). >> >> Fundamentally, you are trying to make an inherently unsafe configuration >> "safer", but it's impossible. Assume you have mlock'd the meta pages int= o >> memory, so the OS never flushes them itself any more, and you're running >> with NOSYNC. That means, within 3 transactions, the data pages on disk w= ill >> be out of sync with the meta pages on disk. If the OS crashes at that po= int, >> the entire DB will be lost. > > Not a problem. > I had explained above - we should write meta-pages explicitly after > the data sync. > But also we should not perform reclaiming ahead of the last checkpoint. > >> The only way to make this mode of operation somewhat safe is to defer >> reclaiming pages for even longer. E.g., instead of halting at current_tx= nid >> - 3, halt at current_txnid - 22, in which case the data pointed to by th= e >> on-disk meta pages cannot get obsolete until 20 transactions have occurr= ed. >> >> Note that in combination with your LIFO patch, it's pretty much guarante= ed >> that the on-disk meta pages will be useless after only 2 un-sync'd >> transactions. > > Yes, Howard, you are right. > But I think there is confusion in the discussion because of mixing of > LIFO-feature and changes for checkpoints consistency in a NOSYNC and > WRITEMAP+NOSYNC modes. > For a "NOSYNC + checkpoints" topic I will submit a separate ITS (like > a 'volaile' related 7969,7970,7971). > My opinion - it is a flaw, and no reason to don't fix it. > > Continuing the conversation about checkpoints in a LIFO context. > I saw the problem, that you specified, and thinking over its solution, > but have not yet found "golden ratio". > And since we are having a serious problem with syncrepl, then I put > off this task with an excuse "LIFO-patch not does worse than it was." > > In general, we should do not reclaim anything ahead of the txn, that > is synced to the disk (let this be named a R-rule). > To do so we need a second field like mti_txnid, but which will be > update only at the end of mdb_env_write_meta(). > Finally we should start search in mdb_find_oldest() from value of this > new field instead of the current txn number. > This seems to will be work fine. > > However, I stopped on a reasoning - about the purpose of the > checkpoints, about design LMDB as a product, about the expectations of > the user and the necessary configuration parameters: > - checkpoints are needed ONLY in nosync modes; > - if the user does NOT activate the checkpoints, he do not care about > consistency; > - but if it is turned on, we MUST provide consistency on the checkpoints; > - otherwise a checkpoints feature is thoughtless and should be REMOVED. > Therefore implementation of checkpoints & reclaiming should be updated > to conform to the "R-rule", that noted above. > > From this point of view a LIFO-feature also should be refined, but > nevertheless can be very useful. > - SYNC mode =3D takes a benefit from storage with write-back cache > (assume powered by battery). > - ASYNC/NOSYNC without checkpoint =3D significant reduction of > committed/dirty pages and thereby much less write-iops. > - ASYNC/NOSYNC with checkpoint =3D seems to same as a SYNC case. > > Total of all the above - I think first we need to fix a reclaiming or > delete the checkpoints, and then I will complete LIFO. > >> -- Howard Chu >> CTO, Symas Corp. http://www.symas.com >> Director, Highland Sun http://highlandsun.com/hyc/ >> Chief Architect, OpenLDAP http://www.openldap.org/project/ > > Thank for conversation. > Leonid.
