This references the previously posted (by someone else) article on SUN's libu
memory architecture. If needed I can run through the archives and find the
URL.
I've already written a threaded implementation which has a decent
responsiveness. It did as much as the SUN libu claimed that it would do.
Recently, however, I redesigned the architecture from scratch, taking
advantage of many possible optimizations. Since it's towards the end of the
day, I figured I'd document my architecture here and wait for comments before
beginning the rewrite / redebug.
As a refresher, the SUNs design is based on 4 layers: vmem, slab, depot and
magazine-cache.
At the bottom is vmem, which maps "pages" of memory (in libu, new pages are
acquired from mmap; excess are readily freed via munmap).
The next higher layer is the slab, which partitions a contiguous collection
of pages into an array of raw memory objects. (Objects are fixed sized
pieces of memory). Thus each object size requires a separate slab. The slab
requests n-pages from vmem to construct a new slab when it runs out of memory
objects. It also frees slabs to vmem when all the individual objects have
been reclaimed . Both of these occur automatically w/o external intervention.
The next higher layer is a "depot", which stores "constructed" objects; e.g.
objects that are initialized and potentially contain references to other
memory objects. When memory requests are made and no free objects exist,
requests are passed down to the associated slab (of corresponding size) and a
constructor is applied. This happens automatically. Free's to the slab are
asynchronous; they occur when the entire system is memory starved, or when a
depot has too many free objects. In any case, prior to a free to the slab,
the destructor is applied to each freed memory object.
The highest level is the magazine-cache The sole function of the cache is
to allow multiple threads to have thread-local accesses to the depot. It
does this by caching up to 2*M objects in a thread-local magazine-structure.
On an alloc, when the magazine is empty, thread-synchronized access is
performed to the underlying depot / slab / vmem layers. On a free, when the
magazine is full (contains 2*M objects), thread-synchronized access to lower
layers is performed. Most of the time there is a mixture of alloc's and
frees which avoids lower-level accesses (and thus mutex locks). Aside from
sync-contention alleviation, this minimizes the sharing of memory objects
between threads, and on multi-CPU systems, this avoids cache pollution (due
to setting a cache flag in MESI to S).
So far, nothing revolutionary, albeit clever on SUNs part. The next
attribute, however, is the use of external boundry tags. The idea is that
the allocated memory should never be written to (SUNs design goal included
the use of read-only allocations). Further, these external tags don't have
to be contiguous. Runs of contiguous memory regions are collected together
in a segment. Segments are allocation regions (such as via
mmap). Within a segment, consolidation may occur. When fully consolidated,
the segment is freed. Hashtables are used to associate a user's memory
pointer with these external boundry tags. From this, SUN claims a constant
(bounded) alloc/free time no matter what is going on. From my analysis, the
only possible wrinkle to this constant time is in the reclaiming stage (which
makes n reclaims, where n is the number of depots which is a relatively low
and constant number)
At run-time, "object-caches" can be defined for arbitrarily sized objects.
Constructors and destructors can optionally be configured. Such object
references use the magazine layer interface. For generic memory access, vmem
is the API, but small memory requests are redirected to one of several
predefined "quantum caches" (of sizes 1-page, 2-pages, .. q_max-pages), which
alleviates lock-contention on vmem when used genericly.
SUN addressed the memory fragmentation problem by placing a lower-bound on a
memory request size. Slabs allocate at least 3 times the object-size. In
most cases, a slab is 16 - 32 times the object size. Further, with q-caches,
all allocations up to q-max have slabs of the same size. Thus slabs used by
8 and 16 byte objects allocations can be trivially intermixed. The only
danger is if you never free all the objects w/in each slab. I call this a
shot-gun effect, since you have little bits throughout memory space,
preventing the freeing of slabs, and thus preventing the interchange of slabs
when needed.
My design is as follows:
sbrk is completely avoided, which leaves the possibility of having two
separate memory managers. The following functions are public:
void* vm_alloc(size)
void vm_free(void*)
void* vm_realloc(void*,size)
int vm_sizeof(void*)
vm_dump_stats()
// automatically allocates a slab (reuses one if possible)
depot_t* depot_create( ... )
int depot_size( depot_t*)
void depot_reclaim( depot_t* ) // called internally or by GC
// cache_create also stores the pointer in thread-local-storage
// relative to the depot
cache_t* cache_create( depot_t*, ... )
// Note objects are of fixed size..
void* cache_alloc(magazine_t*)
void cache_free(magazine_t*, void*)
// these two functions pull from thread-local-storage for the cache_t
pointers
void* depot_alloc(depot_t*)
void depot_free(depot_t*, void*)
Compilation is contingent apon a MULTITHREAD preprocessor flag. With this
flag disabled, all mutex's are obviously avoided, but also, the entire
magazine-cache structure is avoided; depots are directly accessed by
object-caches. All the other features are at least partially benifitial to
even single threaded mode.
My intial impression is that a page should be 1024 bytes (though I'm also
looking at 512 bytes to reduce the shotgun effect of 8-byte allocs).
I've actually found that "q-caches" were detramental to things like
conslidation, and the calling depth (overhead) of a worst-case allocation.
So I'm currently leaving them out. Instead I'm using a power-of-two basd
dq-cache (for differential q-cache).
SUNS q-caches are multiples of the page size up to q_max.. So there'd be
"q_max" q-caches (usually like 5). What I'm calling dq-caches are
subdivisions of a single page. My understanding is that SUNs vmem assumed
all sub-page allocations would be via object-caches, or via a vmem configured
to some incredibly small page-size (like 4 bytes), but this requires many
bytes of overhead for each allocation, and I know that we'll need a lot of
sub 32-byte allocations; thus this was unacceptible to me. Hense my use of
very large page sizes and the use of dq-caches. This was my first real
deviation from SUNs proposal.
One thing I was conscious of was memory alignment. SUN uses a complex
function in place of simple vmem-xxx for arbitrary alignments. Further,
their OS version of vmem requires a size parameter to all vmem functions.
Thus you'd have to remember an allocation size when freeing or reallocating
(vmem_sizeof would be paradoxical). Their paper didn't say what they did
with libu, but I'm assuming they simply prepended 4 bytes to an allocation
which contained the memory size. I didn't want this since this provided too
much overhead for small allocations AND further frustrated memory aligned
allocations. I address the alignment problem by requiring that ALL vmem
allocations are aligned on page-boundries. This trivializes alignment for
sub page allocations, and there are somewhat sophisticated ways of doing
alignment for multiples of a page size. For those alignments that match the
memory size, dq-caches are already aligned. For bizzar alignments (still
below a page) the size is rounded up to a page (thus using slack space). For
sizes larger than a page, the size is rounded up to a page size; 2*num_pages
are allocated, then an offset is calculated to make the memory aligned. All
prior and subsequent pages are then freed. Since a page is an extremely
useful size (used by ALL dq-caches), no memory is wasted (aside from up to a
half page slack space on average which might be recovered via a
vm_realloc anyway).
As with most memory managers, there is an upper bound on memory sizes beyond
which mmap is exclusively utilized. At the moment, I'm leaning towards 128K.
I'm debating using a power-of-two round-up on such allocations to remove
some burdon on the OS.
Now comes the freaky part. SUN utilizes a hash for ALL non explicit
object-memory-allocations. From this they can deduce if a free goes to their
q-cache, regular paging system, or munmap. Obviously an overburden of 8-byte
allocs could bring any hash-table to it's knees (including violating SUNs
claim of real-time access). I'm assuming that by requiring the memory size
even on a free, vmem can use multiple hash-tables; one for each size. But I
don't require the memory size being passed in, NOR do I prepend memory
assignments with the memory size. The only thing I have left are memory
alignments. I assume that any explicit object allocation never uses vmem,
and that implicit object caches (via dq-cache) are _always_ offset from a
page aligned boundry (revisited below). Further, specially user memory
alignment requirements should either use explicit object-caches or the
separate vmemx_(free|alloc|realloc) as with SUN. Now the only remaining
ambiguity is between mmap'd and page-mapped allocations. For this, I assume
that mmap free's are very rare (since you can't have a million allocations of
size >=128K). Thus they are relegated to a slower path.
vmem_free(void* ptr):
ptr_1k = align1k( ptr )
if ptr_1k != ptr: We're in a dq-cache, pass to it (described later)
else if ptr_1k in segment-hash: We're paged
else if ptr_1k in mmap-hash: We're mmaped
else: throw exception "invalid mem reference"
Thus I utilize two separate hashes.. One for normal ( size <= 128K-byte
allocations), and one for larger allocations that require mmap. Note that
the hash-table data-structure is _completely_ inlined with the
data-segments.. No mini-malloc's are required. This complicates the use of
mmap on aligned memory requests, but that's rare enough to warrant extra
processing. The structures are as follows:
#define NUM_PAGES_PER_SEGMENT 255 // possibly 127
#define PAGE_SIZE 1024
#define SEGMENT_SIZE ALIGN_1K( NUM_PAGES_PER_SEGMENT * PAGE_SIZE + sizeof(
page_seg_t) );
#
#mmap-allocation:
#
struct mm_seg_t {
void* address; // hash-unique identifier. contains address of &data[0]
struct mm_seg_t* next, *prev; // hash doubly linked list elements
int size; // for use in realloc
void* data[0]; // begining of data
};
mm_seg_t g_mm_seg_hash[ 128 ];
pthread_mutex g_mm_seg_hash_lock;
#
#segment-allocation
#
typdef struct _page_seg_hash_t {
// address is implicit for segments
struct _page_seg_hash_t* next, *prev; // hash doubly linked list elements
} page_seg_hash_t;
typedef struct {
unsigned char prev_tag_size, tag_size, f_free;
} tag_t;
struct page_seg_t {
seg_hash_t hash_header; // used by hashing algorithms
// segment size is constant and thus implied
int free_pages; // allows quick check of whether to free the segment
tag_t boundry_tags[ NUM_PAGES_PER_SEGMENT + 2 ]; // boundry tags for each
page + 2 delimiting (begin/end) boundry tags.
void* data[0]; // physical location of pages.
};
page_seg_hash_t g_page_seg_hash[ 128 ];
pthread_mutex g_page_seg_hash_lock;
#
# Free pointer chains
#
typedef struct _page_free_t {
struct _page_free_t *next, *prev;
page_set_t* header;
} page_free_t;
typedef struct _available_page_free_t {
struct _available_page_free_t *next, *prev;
int chain_size; // index into g_free_chains
} available_page_free_t;
page_free_t g_free_chains[ NUM_PAGES_PER_SEGMENT ];
available_page_free_t g_available_free_head;
From the above, you can see the embedding of hash datastructures. Segments
are directly allocated from mmap as needed (no pre-allocations on process
init). As with:
page_seg_t* seg = mmap(0, SEGMENT_SIZE, .... );
Tags are implicitly linked to pages. Offsets within seg->boundry_tags[]
correspond directly to PAGE_SIZE offsets into seg->data[], and visa versa.
If NUM_PAGES_PER_SEGMENT is <= 256, then a single byte can be used for
relative addresses. Thus segments max out at 256K. Additionally, one page
is not used (reserved for the header), such that we don't acquire 256K +
sizeof(header). The first and last boundry tag are set to "allocated" so as
to avoid consolidating beyond the edges of the segment.
Allocation of a segment requires initialization of the boundry_tags array,
which will either be 129 or 257 entries as well as insertion into the segment
hash-table.
The mmap hash-table and segment-hashtable have separate mutex's since they
are independant.
I'm looking to make the hashtables static sized, with 128 buckets. Thus,
640Meg worth of 256 page segments (@ 1k / page ) evently distributed across
the hash buckets will have 19 layers of linked-list lookups (in the worst
case there would be 2500 lookups). The hash key currently is:
int key = (addr >> 18) & 127;
256K aligned mmaps should be linearly assigned (at least initially), and thus
pseudo-evenly distribute throughout the hash buckets.
It's assumed that segment_size / 2 is the max practical allocation using
pages, thus after 128K mmaps are used directly for alloc/free.
With P pages in a segment, there are only P possible memory sizes. Thus for
allocation, P free-list chains are used. This avoids much of the trouble
associated with power-of-two free-chain-lists. The only problem is in
searching the free-chain lists. On initialization, there is a single 255
page free element. Allocating 1 page would require navigating 254 empty
pointer chains. But I use a doublly-linked list of filled
free-pointer-chains. Thus on failing to find any free page-chunks of size 1,
the next free-pointer-list is immediately found to be of size 255. Granted,
this requires starting at the head of the available_free, so if you're
looking for page_size >= 250, and there are entries in most free chains, then
you achieve poorer results (but it is at least bounded).
The external boundry tags allows 1k-aligned memory requests with almost no
overhead. It also allows semi-efficient (constant time) consolidation,
though since the segment is of a constant size I'm going to look into using a
buddy system as was described in "Understanding the Linux Kernel". Though
faster, I believe the buddy system will be more wasteful of memory and less
inclined to "grow" memory on realloc's. I should be able to easily
interchange the two since they're such a small part of the vmem design.
vmem_init will have to initialize both hash-tables, and the free-mem list.
It will also have to initialize the dq-caches (below). Spawning new threads
might also require creating thread-local magazines for each dq-cache
(depending on if thread_id -> cpu_id is possible, which would initialize for
each CPU within vmem_init, though this requires locks in the magazine-caches)
Before calling mmap to allocate a new segment, reclaim is applied to the list
of depots. This causes a speed hickup, as well as leaving some important
caches starved, but I can't think of any other time this would be as
valuable. Perhaps if a separate thread's GC were used, but that's not
portable.
Though this is a mismash, it should cover all the technology of the basic
vmem layer.
Next is the slab layer. Each slab has a centralized data structure which
contains various configuation info (alignment, object-size, pointer chains
for both slabs and slabs-with-free-objects, etc.). I'm happy with my
previous work on the slab layer, so I'm not going to post it here, other then
to mention that each slab has a header. This header contains at least:
typedef struct _slab_t {
int object_size; // size of all allocations w/in this slab. Used by
vmem_free, vmem_sizeof, and vmem_realloc to identify the size of a void*.
Note that this is redundant with slab_descriptor_t.
struct _slab_t *next, *prev, *next_free, *prev_free;
int num_free; // free objects in this slab (used to see when slab is empty)
void* first_free; // used for slab allocation
void* data[0]; // beginning of data
} slab_t;
typedef struct _slab_descriptor_t {
pthread_mutex lock;
int object_size;
int num_pages;
slab_t next, prev, next_free, prev_free;
struct _slab_descriptor_t *next_desc, *prev_desc;
... // various precalculated info for more efficient operation
} slab_descriptor_t;
slab_descriptor_t g_slab_descriptor_list_head;
For slab_size <= page_size, the presence of the prefixing header means that
no object address can exist on a page boundry. This fact is exploited by
vmem_free(void*), vmem_sizeof(void*), and vmem_realloc(void*,new_size). It
can immediately skip navigation of g_page_seg_hash, go to the page-boundry
and check the slab-size. Since the existance of an external pointer means
that the slab is not ready to be freed, no locking need occur for such a
lookup. However, in order to allocate-from or free-to the slab, the
slab_descriptor must be locked. As soon as a slab's num_free reaches zero,
it is sent to vmem_free.
Each newly created slab is inserted into the slab-descriptor-list which
allows reuse by subsequent internal calls to slab_create(..).
The next layer is the depot. Though SUN outlined a constructor / destructor
policy. The most common case ( being used by dq_cache at least ) is volitile
objects with no constructor/destructor. Since we're not storing valid
objects, we are free to overwrite their contents with free-pointer-chains (as
do the slab and vmem subsystems). Note that this is only a special case,
which warrants a separate set of handlers. My previous implementation
adhered to SUNs design, and I rename them obj_magazine_XXX and obj_depot_XXX.
They have no use w/in the dq_caches, and their requirements impeed
efficiency. Thus my simplified depot design requires a lower bound object
size of sizeof(void*) * 2, since multiple free pointers are stored in the
object. SUNs design utilized a separately allocated structure called a
magazine which was a mini-stack. My design uses the objects as linked-lists
for the stack. The magazines are chained together (in the depot) by a second
linked list stored on the top-most elements of each stack as with:
typedef struct _magazine_el_t {
struct _magazine_el_t *next_el;
struct _magazine_el_t* next_stack;
} magazine_el_t;
union object_t {
magazine_el_t el;
void* data[0];
};
typedef struct _depot_t {
object_t* full_magazines;
int max_stack_size; // increased after contentions for depot locks
int max_num_magazines; // max number of magazines to hold before freeing to
slab.
int num_free_objects;
pthread_mutex lock;
struct _depot_t *next_depot, *prev_depot; // for g_depot_list;
... // other items in my previous implementation
} depot_t;
depot_t g_depot_list_head;
From this there is no need to maintain two lists (one of filled magazines and
one of empty magazines), since there is no concept of an empty magazine.
This adds to performance since you don't have to allocate magazine structures
(which in my previous implementation did impeed "free(..)" performance, as
well as memory usage). All access to the depot is thread-synchronized. But
since each memory-request-size maps to a different depot, there is a smaller
likelihood of contention. Contention is further addressed by modifying
"max_stack_size" after a failed "try_lock" on the depot. It is reduced when
"reclaim" calls are made. I haven't decided how to deal with
"max_num_magazines", since theoretically this is handled by reclaim.
Lastly the magazine-cache layer uses:
typedef struct _cache_t {
int num_data_objects;
int num_prev_objects;
object_t *data, *prev;
depot_t * depot;
} cache_t;
This is pretty close to the SUN architecture except that we're not using
magazines, but instead the data-objects themselves as linked-list stacks.
When depot->max_stack_size "cache_free"'s occur, then data and prev are
swapped. If "data" is still too full, then it's placed into the depot (after
locking). Instead of allocating an empty magazine, data is reset to null and
num_data_objects is reset to zero.
I haven't completely thought out the shared readonly use of
depot->max_stack_size. I'm inclined to fix it to 5 for the time being (since
I don't know how bad it is to read from this value as another thread is
potentially over-writing it). I might just pull a volitile copy of it into
cache_t, and have it updated on every access to depot. Additionally, if the
stack-size is dynamic, then I'll have to add yet another field to
magazine_el_t, which increases the lower bound of memory-sizes (from 8 to 12,
or in the case of 64bit pointers, from 16 to 20).
Note that the use of constructed objects requires external magazines. In my
previous implementation, I held an upper-bound of 10 items / magazine, which
meant 48 bytes / magazine. An interesting ramification is that an object
free periodically requires an alloc from the 48-byte cache (which thankfully
isn't constructed which would otherwise quickly become recursive). The
implementation only kept a single free magazine in the depot so as not to
waste them (since all constructed object-caches need them).
Piecing things together:
With page-size = 1024, there are 12 dq-cache sizes of 8, 12, 16, 24, 32, 48,
64, 96, 128, 192, 256, 384. (pow 2/3). Each thread needs a thread-local
pointer to appropriate caches. As before, depot_alloc / depot_free accesses
this thread-local storage. Thus when vmem wants to access these dq_caches,
it uses the depot_xx functions. I _could_ supply a vmem function varient
which is given a pointer to the structure of thread-specific caches which
avoid this semi-expensive indirection.
For small memory requests the flow go as follows:
vmem_alloc(sz) -> depot_alloc(..) -> cache_alloc(..) [ -> ( access depot ) [
-> (access slab) [ -> vmem_alloc(slab->num_pages) [ -> mmap(..) ] ] ] ]
Where [..] are successively rarer (though periodic) occurances. The depot
and slab access occurs without additional function calls.
Pure object allocations skip the first one or two steps.
For medium sized allocations the flow is (including the above use of vmem by
slab):
vmem_alloc(sz) -> (lock g_seg_lock) -> (traverse free-list) [ -> (traverse
available-free-lists to find a free-list) [ -> seg = mmap(..) -> (init
segment) -> ( add seg to seg_hash) ] ] [ -> split free chunk into two
smaller chunks ] -> (allocate mem-object)
For large allocations:
vmem_alloc(sz) -> (lock g_mm_seg_lock) -> mmap(..) -> (add ptr to mm_seg_hash)
For small frees:
vmem_free(ptr) -> depot_free(..) -> cache_free(..) [ -> (access depot) [ ->
(access slab) [ -> vmem_free(slab) [ -> munmap(..) ] ] ] ]
For medium frees:
vmem_free(ptr) -> (lock g_seg_lock) -> (traverse g_seg_hash to find
associated seg) -> (consolidate free chain) -> [ ((remove seg from
g_seg_hash) -> munmap(seg) ) OR ((attach ptr to free-list) [ -> add
free-list to g_available_free_lists ] ) ]
For large frees:
vmem_free(ptr) -> (lock g_mm_seg_lock) -> (traverse g_mm_seg_hash) -> (delete
munmap(...)
Note that vmem does not lock until it has to. Most notably vmem -> cache ->
vmem only locks on the second invocation of vmem, since the first invocation
only analyzies the passed parameters.
One interesting thing to note. There's no underlying need to use mmap as the
back-end. My previous prototype just used malloc / free for the backend..
One disadvantage of doing this is the double memory initialization. On the
other hand, it can dramatically avoid OS calls (via mmap) if the underlying
memory manager can handle 256K+ allocations in a non-locking manner. (GNU
malloc does a good job of not blocking).
There are two obvious shortcomming. First with dq_cache size 8, there are
124 objects per slab. So long as a single object remains unalloced (perhaps
dormant on an idle thread's magazine-cache, beyond the reach of a reclaimer),
the slab can not be reclaimed to vmem. It's possible to allocate 1 permanent
then 123 temporary, then 1 permanent ad infinitum. When all the temporaries
are freed (possibly several meg worth), the permanent allocations prevent any
slabs from being freed (for reuse by other memory sizes, though differing
objects of the same size can). The only way to compensate for this is to
utilize smaller num_object sizes. Unfortunately if we used a smaller
slab-stash size than a page, we'd violate our anti-fragmentation policy
(inherent in SUNs design). Alternatively, we could reduce the page size (to
512 bytes) at the expense of fewer dq_caches and a higher number of mmaps
(128K segments and 64K upper-bound prior to per-alloc mmaps). On Linux, at
least, this could be a heavy burden since its VM-mappings utilize linear
linked lists (instead of Solaris's vmem-based real-time system).
The second short comming is that vmem has only two locks, no matter how many
threads are available. Thankfully this is only a problem for allocations
with sizes between page_size and max_pages * page_size. If allocations of
that region are known to be popular, a cache should be configured for the
appropriate sizes. There is one data-type in particular, however that
thwarts this.. The dynamic array (and associated hash bucket array). This
doubles in size on each growth, and will thus traverse several mid-ranged
sizes (1,2,4,8,16,32, 64, and 128). To make up for this, it's entirely
possible that reallocs can avoid memory copies in this central region,
thereby reducing realloc compute-time. Additionally, if more memory is
needed, the segment-lock can be released while performing garbage collection
(traversing the depot list for reclaiming, locking each depot as it goes).
In my previous experimentation, there was a measurable performance boost
between direct object cache calls and vmem indirect calls, and an even bigger
hit when caches used locks (which allows a reclaimer to flush magazine
stack-values). Thus I still maintain that cache's should not use locks, and
that parrot's interpreter should directly utilize cache objects for all known
fixed sized objects (which unfortunately excludes dynamic arrays).
This is a lot of stuff and is very verbal (light on actual code) so it's
probably hard to read. But it's at least on record for me to reference.
This was all information I had on my white-board.
Enjoy
-Michael