On 02/11/2017 02:33 AM, Andres Freund wrote:
On 2017-02-11 02:13:59 +0100, Tomas Vondra wrote:
On 02/09/2017 10:37 PM, Andres Freund wrote:
On 2016-12-13 01:45:13 +0100, Tomas Vondra wrote:
src/backend/utils/mmgr/Makefile | 2 +-
src/backend/utils/mmgr/aset.c | 115 +--------------------------------
src/backend/utils/mmgr/memdebug.c | 131 ++++++++++++++++++++++++++++++++++++++
src/include/utils/memdebug.h | 22 +++++++
4 files changed, 156 insertions(+), 114 deletions(-)
create mode 100644 src/backend/utils/mmgr/memdebug.c
I'm a bit loathe to move these to a .c file - won't this likely make
these debugging tools even slower? Seems better to put some of them
them in a header as static inlines (not randomize, but the rest).
Do you have any numbers to support that? AFAICS compilers got really good in
inlining stuff on their own.
Unless you use LTO, they can't inline across translation units. And
using LTO is slow enough for linking that it's not that much fun to use,
as it makes compile-edit-compile cycles essentially take as long as a
From 43aaabf70b979b172fd659ef4d0ef129fd78d72d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Wed, 30 Nov 2016 15:36:23 +0100
Subject: [PATCH 2/3] slab allocator
src/backend/replication/logical/reorderbuffer.c | 82 +--
src/backend/utils/mmgr/Makefile | 2 +-
src/backend/utils/mmgr/slab.c | 803 ++++++++++++++++++++++++
src/include/nodes/memnodes.h | 2 +-
src/include/nodes/nodes.h | 1 +
src/include/replication/reorderbuffer.h | 15 +-
src/include/utils/memutils.h | 9 +
I'd like to see the reorderbuffer changes split into a separate commit
from the slab allocator introduction.
I rather dislike patches that only add a bunch of code, without actually
using it anywhere.
But if needed, this is trivial to do at commit time - just don't
commit the reorderbuffer bits.
+ * Each block includes a simple bitmap tracking which chunks are used/free.
+ * This makes it trivial to check if all chunks on the block are free, and
+ * eventually free the whole block (which is almost impossible with a
+ * freelist of chunks, storing chunks from all blocks).
Why is checking a potentially somewhat long-ish bitmap better than a
simple counter, or a "linked list" of "next free chunk-number" or such
(where free chunks simply contain the id of the subsequent chunk)?
Using a list instead of a bitmap would also make it possible to get
'lifo' behaviour, which is good for cache efficiency. A simple
chunk-number based singly linked list would only imply a minimum
allocation size of 4 - that seems perfectly reasonable?
A block-level counter would be enough to decide if all chunks on the block
are free, but it's not sufficient to identify which chunks are free /
available for reuse.
The bitmap only has a single bit per chunk, so I find "potentially long-ish"
is a bit misleading. Any linked list implementation will require much more
per-chunk overhead - as the chunks are fixed-legth, it's possible to use
chunk index (instead of 64-bit pointers), to save some space. But with large
blocks / small chunks that's still at least 2 or 4 bytes per index, and
you'll need two (to implement doubly-linked list, to make add/remove
For example assume 8kB block and 64B chunks, i.e. 128 chunks. With bitmap
that's 16B to track all free space on the block. Doubly linked list would
require 1B per chunk index, 2 indexes per chunk. That's 128*2 = 256B.
I have a hard time believing this the cache efficiency of linked lists
(which may or may not be real in this case) out-weights this, but if you
want to try, be my guest.
I'm not following - why would there be overhead in anything for
allocations bigger than 4 (or maybe 8) bytes? You can store the list
(via chunk ids, not pointers) inside the chunks itself, where data
otherwise would be. And I don't see why you'd need a doubly linked
list, as the only operations that are needed are to push to the front of
the list, and to pop from the front of the list - and both operations
are simple to do with a singly linked list?
Oh! I have not considered storing the chunk indexes (for linked lists)
in the chunk itself, which obviously eliminates the overhead concerns,
and you're right a singly-linked list should be enough.
There will be some minimum-chunk-size requirement, depending on the
block size/chunk size. I wonder whether it makes sense to try to be
smart and make it dynamic, so that we only require 1B or 2B for cases
when only that many chunks fit into a block, or just say that it's 4B
and be done with it.
I mean 2^32 chunks ought to be enough for anyone, right?
I'm still not buying the cache efficiency argument, though. One of the
reasons is that the implementation prefers blocks with fewer free chunks
when handling palloc(), so pfree() is making the block less likely to be
chosen by the next palloc().
Thirdly, isn't that approach going to result in a quite long freelists
array, when you have small items and a decent blocksize? That seems like
a fairly reasonable thing to do?
I'm confused. Why wouldn't that be reasonable. Or rather, what would be a
more reasonable way?
If I understood correctly, you have one an array of doubly linked lists.
A block is stored in the list at the index #block's-free-elements. Is that
If so, if you have e.g. 8 byte allocations and 64kb sized blocks, you
end up with an array of 1024 doubly linked lists, which'll take up 64kb
on its own. And there a certainly scenarios where even bigger block
sizes could make sense. That's both memory overhead, and runtime
overhead, because at reset-time we'll have to check the whole array
(which'll presumably largely be empty lists). Resetting is a pretty
True, but it's not entirely clear if resetting is common for the paths
where we use those new allocators.
Also, if we accept that it might be a problem, what other solution you
propose? I don't think just merging everything into a single list is a
good idea, for the reasons I explained before (it might make the resets
somewhat less expensive, but it'll make pfree() more expensive).
What might work is replacing the array with a list, though. So we'd have
a list of lists, which would eliminate the array overhead.
+ * We need to update index of the next free chunk on the block. If we
+ * the last free chunk on this block, set it to chunksPerBlock (which is
+ * not a valid chunk index). Otherwise look for the next chunk - we know
+ * that it has to be above the current firstFreeChunk value, thanks to
+ * we maintain firstFreeChunk here and in SlabFree().
+ if (block->nfree == 0)
+ block->firstFreeChunk = set->chunksPerBlock;
+ /* look for the next free chunk in the block, after the first
+ while ((++block->firstFreeChunk) < set->chunksPerBlock)
+ int byte = block->firstFreeChunk /
+ int bit = block->firstFreeChunk % 8;
+ /* stop when you find 0 (unused chunk) */
+ if (!(block->bitmapptr[byte] & (0x01 << bit)))
+ /* must have found the free chunk */
+ Assert(block->firstFreeChunk != set->chunksPerBlock);
This and previous code just re-affirms my opinion that a bitmap is not
the best structure here.
It'd be great if you could explain why, instead of just making such claims
Because it's complicated. This is a fair bit of code and branches to
run in a pretty hot path.
Hmm. I admit updating the index of the first free chunk is a bit
cumbersome, and the linked list would make it unnecessary.
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: