Hi,
I'm struggling with an issue that a large, but mostly free LMDB database is 
claiming MDB_MAP_FULL when trying to store a large write transaction.

The LMDB mapsize is configured to 40 GiB and indeed, the database (data.mdb) 
already has this size. Thus, it cannot grow.

mdb_stat says:

Status of Main DB
  Tree depth: 3
  Branch pages: 24
  Leaf pages: 1123
  Overflow pages: 152153
  Entries: 27423

Computing [ (branch + leaf + overflow) * page_size ] (I assume page_size is 4 
KiB) leads to only ~600 MiB of the real database usage, which corresponds to 
the estimated amount of data stored there. So the database is used by just 1.5% 
and ought to have plenty free space inside.

The write attempt that fails, tries to insert 700 MiB more data in one 
transaction, still just small percentage of the mapsize. However, this fails 
with MDB_MAP_FULL.

Note that our application stores such large data in the way that it splits them 
into small chunks (~70 KiB each) and stores them as many key-value records. 
This is done to avoid issues with searching for too large continuous free space.

I have tracked down one such failing write attempt with gdb to see what exactly 
fails:

#0  mdb_page_alloc (mc=0x7fffd7e9fa08, num=8, mp=0x7fffd7e9f518) at 
contrib/lmdb/mdb.c:2286
#1  0x00000000004e2cf6 in mdb_page_new (mc=0x7fffd7e9fa08, flags=4, num=8, 
mp=0x7fffd7e9f578) at contrib/lmdb/mdb.c:7178
#2  0x00000000004e6108 in mdb_node_add (mc=0x7fffd7e9fa08, indx=98, 
key=0x7fffd7e9fc08, data=0x7fffd7e9fc20, pgno=0, flags=65536) at 
contrib/lmdb/mdb.c:7320
#3  0x00000000004de628 in mdb_cursor_put (mc=0x7fffd7e9fa08, 
key=0x7fffd7e9fc08, data=0x7fffd7e9fc20, flags=65536) at contrib/lmdb/mdb.c:6947
#4  0x00000000004e830a in mdb_put (txn=0x7fffd000cd80, dbi=1, 
key=0x7fffd7e9fc08, data=0x7fffd7e9fc20, flags=65536) at contrib/lmdb/mdb.c:9022

It shows an unsuccessful attempt to allocate 8 pages (interestingly, quite a 
little amount, since it had allocated a lot of 20-paged chunks just before).

Tracking down the LMDB source code I found this interesting line: 
https://git.openldap.org/openldap/openldap/-/blob/master/libraries/liblmdb/mdb.c#L2163

Unfortunately, I have not enough insight to understand mdb_page_alloc() in its 
complexity, but this seems like a kind of heurstics. It limits the number of 
scanned fragments of free space (for unknown reason). In the past, it has also 
been "tuned": 
https://git.openldap.org/openldap/openldap/-/commit/5ee99f1125a775f28ed69b06d991a43c60d894a9

I tried to patch my copy of LMDB source code by doubling this magic constant 
(60 -> 120), and voila, this particular write transaction succeeded afterwards. 
However, this is obvoiusly no sustainable way of sorting this out.

What is interesting to me is that the number of retries relies only on the size 
of the allocated chunk (number of pages to alloc), but not the mapsize. It 
seems possible to me that in a huge database, there might probably be a lot of 
tiny (e.g. one-page) fragments of free space, that need to be skipped before 
finding any large-enough one.

However, I don't yet dare to create an issue on this.

Could you please tell me if my thoughts lead some proper, or some wrong way? 
Does LMDB need some improvement to avoid this (and similar) issues?

Do you think that chunking the large data to even smaller parts (70 KiB -> 3 
KiB) by our application could be more clever to reduce fragmentation of 
overflow-pages-space in LMDB?

Thank you for any hints and more insight to LMDB internals.

Many cheers,
Libor

Knot DNS | CZ NIC

Reply via email to