>>> We are using the combination: >>> envflags writemap nosync lifo >>> checkpoint 0 1 >>> >>> If the checkpoint is set in seconds, it gives us the assurance >>> consistent state database on disk. >>> However, without this patch meta-pages can be written by the kernel >>> before the data. >>> >>> In fact, for a full guarantee in case of death slapd process, >>> meta-page should be written explicitly. > > > No, the DB can never go inconsistent due to a process crashing - the pages > in OS cache are always correct. It can only go inconsistent if the OS > crashes and a proper sync has not occurred.
Yes, Howard, you are right. But apparently I need to be more precise. Talking about "death" of slapd, I meant all the reasons, including power off. For example, a power-off case: - The main power is turned off and the system switches to the UPS. - Given the notice, OS starts an emergency stop processes. - For some reason (does not have enough time to stop) slapd receives SIGKILL. - OS tries to write mmap-region of the DB-file and begins with the lower address. - let the meta-pages has written completely, but for the rest of the data is not enough battery power. - Now DB is completely destroyed on the disk. To avoid this, the meta-pages should not be included in the rw-mapped region, and should be written explicity after a data pages. >>>>> commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae >>>>> Author: Leo Yuriev <[email protected]> >>>>> Date: 2014-09-19 22:47:19 +0400 >>>>> >>>>> BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env(). >>>>> >>>>> Meta-pages may be updated during data-syncing in mdb_sync_env(), >>>>> in this case database would be inconsistent. >>>>> >>>>> Check-and-retry if lead txn-id changed during flushing data in >>>>> mdb_sync_env(). > > Fundamentally, you are trying to make an inherently unsafe configuration > "safer", but it's impossible. Assume you have mlock'd the meta pages into > memory, so the OS never flushes them itself any more, and you're running > with NOSYNC. That means, within 3 transactions, the data pages on disk will > be out of sync with the meta pages on disk. If the OS crashes at that point, > the entire DB will be lost. Not a problem. I had explained above - we should write meta-pages explicitly after the data sync. But also we should not perform reclaiming ahead of the last checkpoint. > The only way to make this mode of operation somewhat safe is to defer > reclaiming pages for even longer. E.g., instead of halting at current_txnid > - 3, halt at current_txnid - 22, in which case the data pointed to by the > on-disk meta pages cannot get obsolete until 20 transactions have occurred. > > Note that in combination with your LIFO patch, it's pretty much guaranteed > that the on-disk meta pages will be useless after only 2 un-sync'd > transactions. Yes, Howard, you are right. But I think there is confusion in the discussion because of mixing of LIFO-feature and changes for checkpoints consistency in a NOSYNC and WRITEMAP+NOSYNC modes. For a "NOSYNC + checkpoints" topic I will submit a separate ITS (like a 'volaile' related 7969,7970,7971). My opinion - it is a flaw, and no reason to don't fix it. Continuing the conversation about checkpoints in a LIFO context. I saw the problem, that you specified, and thinking over its solution, but have not yet found "golden ratio". And since we are having a serious problem with syncrepl, then I put off this task with an excuse "LIFO-patch not does worse than it was." In general, we should do not reclaim anything ahead of the txn, that is synced to the disk (let this be named a R-rule). To do so we need a second field like mti_txnid, but which will be update only at the end of mdb_env_write_meta(). Finally we should start search in mdb_find_oldest() from value of this new field instead of the current txn number. This seems to will be work fine. However, I stopped on a reasoning - about the purpose of the checkpoints, about design LMDB as a product, about the expectations of the user and the necessary configuration parameters: - checkpoints are needed ONLY in nosync modes; - if the user does NOT activate the checkpoints, he do not care about consistency; - but if it is turned on, we MUST provide consistency on the checkpoints; - otherwise a checkpoints feature is thoughtless and should be REMOVED. Therefore implementation of checkpoints & reclaiming should be updated to conform to the "R-rule", that noted above. >From this point of view a LIFO-feature also should be refined, but nevertheless can be very useful. - SYNC mode = takes a benefit from storage with write-back cache (assume powered by battery). - ASYNC/NOSYNC without checkpoint = significant reduction of committed/dirty pages and thereby much less write-iops. - ASYNC/NOSYNC with checkpoint = seems to same as a SYNC case. Total of all the above - I think first we need to fix a reclaiming or delete the checkpoints, and then I will complete LIFO. > -- Howard Chu > CTO, Symas Corp. http://www.symas.com > Director, Highland Sun http://highlandsun.com/hyc/ > Chief Architect, OpenLDAP http://www.openldap.org/project/ Thank for conversation. Leonid.
