Re: (ITS#7958) LMDB: LIFO-reclaiming, write-performance improvement & bugfixes

leo Sun, 19 Oct 2014 14:29:23 -0700

>>> We are using the combination:
>>>    envflags writemap nosync lifo
>>>    checkpoint 0 1
>>>
>>> If the checkpoint is set in seconds, it gives us the assurance
>>> consistent state database on disk.
>>> However, without this patch meta-pages can be written by the kernel
>>> before the data.
>>>
>>> In fact, for a full guarantee in case of death slapd process,
>>> meta-page should be written explicitly.
>
>
> No, the DB can never go inconsistent due to a process crashing - the pages
> in OS cache are always correct. It can only go inconsistent if the OS
> crashes and a proper sync has not occurred.


Yes, Howard, you are right.
But apparently I need to be more precise.
Talking about "death" of slapd, I meant all the reasons, including power off.

For example, a power-off case:
- The main power is turned off and the system switches to the UPS.
- Given the notice, OS starts an emergency stop processes.
- For some reason (does not have enough time to stop) slapd receives SIGKILL.
- OS tries to write mmap-region of the DB-file and begins with the
lower address.
- let the meta-pages has written completely, but for the rest of the
data is not enough battery power.
- Now DB is completely destroyed on the disk.

To avoid this, the meta-pages should not be included in the rw-mapped
region, and should be written explicity after a data pages.


>>>>> commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
>>>>> Author: Leo Yuriev <[email protected]>
>>>>> Date:   2014-09-19 22:47:19 +0400
>>>>>
>>>>>        BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().
>>>>>
>>>>>        Meta-pages may be updated during data-syncing in mdb_sync_env(),
>>>>>        in this case database would be inconsistent.
>>>>>
>>>>>        Check-and-retry if lead txn-id changed during flushing data in
>>>>> mdb_sync_env().
>
> Fundamentally, you are trying to make an inherently unsafe configuration
> "safer", but it's impossible. Assume you have mlock'd the meta pages into
> memory, so the OS never flushes them itself any more, and you're running
> with NOSYNC. That means, within 3 transactions, the data pages on disk will
> be out of sync with the meta pages on disk. If the OS crashes at that point,
> the entire DB will be lost.

Not a problem.
I had explained above - we should write meta-pages explicitly after
the data sync.
But also we should not perform reclaiming ahead of the last checkpoint.

> The only way to make this mode of operation somewhat safe is to defer
> reclaiming pages for even longer. E.g., instead of halting at current_txnid
> - 3, halt at current_txnid - 22, in which case the data pointed to by the
> on-disk meta pages cannot get obsolete until 20 transactions have occurred.
>
> Note that in combination with your LIFO patch, it's pretty much guaranteed
> that the on-disk meta pages will be useless after only 2 un-sync'd
> transactions.

Yes, Howard, you are right.
But I think there is confusion in the discussion because of mixing of
LIFO-feature and changes for checkpoints consistency in a NOSYNC and
WRITEMAP+NOSYNC modes.
For a "NOSYNC + checkpoints" topic I will submit a separate ITS (like
a 'volaile' related 7969,7970,7971).
My opinion - it is a flaw, and no reason to don't fix it.

Continuing the conversation about checkpoints in a LIFO context.
I saw the problem, that you specified, and thinking over its solution,
but have not yet found "golden ratio".
And since we are having a serious problem with syncrepl, then I put
off this task with an excuse "LIFO-patch not does worse than it was."

In general, we should do not reclaim anything ahead of the txn, that
is synced to the disk (let this be named a R-rule).
To do so we need a second field like mti_txnid, but which will be
update only at the end of mdb_env_write_meta().
Finally we should start search in mdb_find_oldest() from value of this
new field instead of the current txn number.
This seems to will be work fine.

However, I stopped on a reasoning - about the purpose of the
checkpoints, about design LMDB as a product, about the expectations of
the user and the necessary configuration parameters:
- checkpoints are needed ONLY in nosync modes;
- if the user does NOT activate the checkpoints, he do not care about
consistency;
- but if it is turned on, we MUST provide consistency on the checkpoints;
- otherwise a checkpoints feature is thoughtless and should be REMOVED.
Therefore implementation of checkpoints & reclaiming should be updated
to conform to the "R-rule", that noted above.

>From this point of view a LIFO-feature also should be refined, but
nevertheless can be very useful.
- SYNC mode = takes a benefit from storage with write-back cache
(assume powered by battery).
- ASYNC/NOSYNC without checkpoint = significant reduction of
committed/dirty pages and thereby much less write-iops.
- ASYNC/NOSYNC with checkpoint = seems to same as a SYNC case.

Total of all the above - I think first we need to fix a reclaiming or
delete the checkpoints, and then I will complete LIFO.

>   -- Howard Chu
>   CTO, Symas Corp.           http://www.symas.com
>   Director, Highland Sun     http://highlandsun.com/hyc/
>   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Thank for conversation.
Leonid.

Re: (ITS#7958) LMDB: LIFO-reclaiming, write-performance improvement & bugfixes

Reply via email to