vmem memory manager

Michael L Maraist Wed, 31 Oct 2001 22:37:59 -0800

This references the previously posted (by someone else) article on SUN's libu 
memory architecture.  If needed I can run through the archives and find the 
URL.


I've already written a threaded implementation which has a decent 
responsiveness.  It did as much as the SUN libu claimed that it would do.  
Recently, however, I redesigned the architecture from scratch, taking 
advantage of many possible optimizations.  Since it's towards the end of the 
day, I figured I'd document my architecture here and wait for comments before 
beginning the rewrite / redebug.

As a refresher, the SUNs design is based on 4 layers: vmem, slab, depot and 
magazine-cache.

At the bottom is vmem, which maps "pages" of memory (in libu, new pages are 
acquired from mmap; excess are readily freed via munmap).

The next higher layer is the slab, which partitions a contiguous collection 
of pages into an array of raw memory objects.  (Objects are fixed sized 
pieces of memory).  Thus each object size requires a separate slab.  The slab 
requests n-pages from vmem to construct a new slab when it runs out of memory 
objects.  It also frees slabs to vmem when all the individual objects have 
been reclaimed .  Both of these occur automatically w/o external intervention.

The next higher layer is a "depot", which stores "constructed" objects; e.g. 
objects that are initialized and potentially contain references to other 
memory objects.  When memory requests are made and no free objects exist, 
requests are passed down to the associated slab (of corresponding size) and a 
constructor is applied.  This happens automatically.  Free's to the slab are 
asynchronous; they occur when the entire system is memory starved, or when a 
depot has too many free objects.  In any case, prior to a free to the slab, 
the destructor is applied to each freed memory object.

The highest level is the magazine-cache  The sole function of the cache is 
to allow multiple threads to have thread-local accesses to the depot.  It 
does this by caching up to 2*M objects in a thread-local magazine-structure.  
On  an alloc, when the magazine is empty, thread-synchronized access is 
performed to the underlying depot / slab / vmem layers.  On a free, when the 
magazine is full (contains 2*M objects), thread-synchronized access to lower 
layers is performed.  Most of the time there is a mixture of alloc's and 
frees which avoids lower-level accesses (and thus mutex locks).  Aside from 
sync-contention alleviation, this minimizes the sharing of memory objects 
between threads, and on multi-CPU systems, this avoids cache pollution (due 
to setting a cache flag in MESI to S).

So far, nothing revolutionary, albeit clever on SUNs part.  The next 
attribute, however, is the use of external boundry tags.  The idea is that 
the allocated memory should never be written to (SUNs design goal included 
the use of read-only allocations).  Further, these external tags don't have 
to be contiguous.  Runs of contiguous memory regions are collected together 
in a segment.  Segments are allocation regions (such as via 
mmap).  Within a segment, consolidation may occur.  When fully consolidated, 
the segment is freed.  Hashtables are used to associate a user's memory 
pointer with these external boundry tags.  From this, SUN claims a constant 
(bounded) alloc/free time no matter what is going on.  From my analysis, the 
only possible wrinkle to this constant time is in the reclaiming stage (which 
makes n reclaims, where n is the number of depots which is a relatively low 
and constant number)

At run-time, "object-caches" can be defined for arbitrarily sized objects.  
Constructors and destructors can optionally be configured.  Such object 
references use the magazine layer interface.  For generic memory access, vmem 
is the API, but small memory requests are redirected to one of several 
predefined "quantum caches" (of sizes 1-page, 2-pages, .. q_max-pages), which 
alleviates lock-contention on vmem when used genericly.

SUN addressed the memory fragmentation problem by placing a lower-bound on a 
memory request size.  Slabs allocate at least 3 times the object-size.  In 
most cases, a slab is 16 - 32 times the object size.  Further, with q-caches, 
all allocations up to q-max have slabs of the same size.  Thus slabs used by 
8 and 16 byte objects allocations can be trivially intermixed.  The only 
danger is if you never free all the objects w/in each slab.  I call this a 
shot-gun effect, since you have little bits throughout memory space, 
preventing the freeing of slabs, and thus preventing the interchange of slabs 
when needed.

My design is as follows:

sbrk is completely avoided, which leaves the possibility of having two 
separate memory managers.  The following functions are public:
void* vm_alloc(size)
void  vm_free(void*)
void* vm_realloc(void*,size)
int      vm_sizeof(void*)
          vm_dump_stats()

// automatically allocates a slab  (reuses one if possible)
depot_t* depot_create( ... )
int depot_size( depot_t*)
void depot_reclaim( depot_t* ) // called internally or by GC

// cache_create also stores the pointer in thread-local-storage
// relative to the depot
cache_t* cache_create( depot_t*, ... ) 

// Note objects are of fixed size..
void* cache_alloc(magazine_t*)
void   cache_free(magazine_t*, void*)

// these two functions pull from thread-local-storage for the cache_t 
pointers
void* depot_alloc(depot_t*)
void   depot_free(depot_t*, void*)


   

Compilation is contingent apon a MULTITHREAD preprocessor flag.  With this 
flag disabled, all mutex's are obviously avoided, but also, the entire 
magazine-cache structure is avoided; depots are directly accessed by 
object-caches.  All the other features are at least partially benifitial to 
even single threaded mode.

My intial impression is that a page should be 1024 bytes (though I'm also 
looking at 512 bytes to reduce the shotgun effect of 8-byte allocs).  
I've actually found that "q-caches" were detramental to things like 
conslidation, and the calling depth (overhead) of a worst-case allocation.  
So I'm currently leaving them out.  Instead I'm using a power-of-two basd 
dq-cache (for differential q-cache).  

SUNS q-caches are multiples of the page size up to q_max..  So there'd be 
"q_max" q-caches (usually like 5).  What I'm calling dq-caches are 
subdivisions of a single page.  My understanding is that SUNs vmem assumed 
all sub-page allocations would be via object-caches, or via a vmem configured 
to some incredibly small page-size (like 4 bytes), but this requires many 
bytes of overhead for each allocation, and I know that we'll need a lot of 
sub 32-byte allocations; thus this was unacceptible to me.  Hense my use of 
very large page sizes and the use of dq-caches.  This was my first real 
deviation from SUNs proposal.

One thing I was conscious of was memory alignment.  SUN uses a complex 
function in place of simple vmem-xxx for arbitrary alignments.  Further, 
their OS version of vmem requires a size parameter to all vmem functions.  
Thus you'd have to remember an allocation size when freeing or reallocating 
(vmem_sizeof would be paradoxical).  Their paper didn't say what they did 
with libu, but I'm assuming they simply prepended 4 bytes to an allocation 
which contained the memory size.  I didn't want this since this provided too 
much overhead for small allocations AND further frustrated memory aligned 
allocations.  I address the alignment problem by requiring that ALL vmem 
allocations are aligned on page-boundries.  This trivializes alignment for 
sub page allocations, and there are somewhat sophisticated ways of doing 
alignment for multiples of a page size.  For those alignments that match the 
memory size, dq-caches are already aligned.  For bizzar alignments (still 
below a page) the size is rounded up to a page (thus using slack space).  For 
sizes larger than a page, the size is rounded up to a page size; 2*num_pages 
are allocated, then an offset is calculated to make the memory aligned.  All 
prior and subsequent pages are then freed.  Since a page is an extremely 
useful size (used by ALL dq-caches), no memory is wasted (aside from up to a 
half page slack space on average which might be recovered via a 
vm_realloc anyway).

As with most memory managers, there is an upper bound on memory sizes beyond 
which mmap is exclusively utilized.  At the moment, I'm leaning towards 128K. 
 I'm debating using a power-of-two round-up on such allocations to remove 
some burdon on the OS.

Now comes the freaky part.  SUN utilizes a hash for ALL non explicit 
object-memory-allocations.  From this they can deduce if a free goes to their 
q-cache, regular paging system, or munmap.  Obviously an overburden of 8-byte 
allocs could bring any hash-table to it's knees (including violating SUNs 
claim of real-time access).  I'm assuming that by requiring the memory size 
even on a free, vmem can use multiple hash-tables; one for each size.  But I 
don't require the memory size being passed in, NOR do I prepend memory 
assignments with the memory size.  The only thing I have left are memory 
alignments.  I assume that any explicit object allocation never uses vmem, 
and that implicit object caches (via dq-cache)  are _always_ offset from a 
page aligned boundry (revisited below).  Further, specially user memory 
alignment requirements should either use explicit object-caches or the 
separate vmemx_(free|alloc|realloc) as with SUN.  Now the only remaining 
ambiguity is between mmap'd and page-mapped allocations.  For this, I assume 
that mmap free's are very rare (since you can't have a million allocations of 
size >=128K).  Thus they are relegated to a slower path.

vmem_free(void* ptr):
  ptr_1k = align1k( ptr )
  if ptr_1k != ptr:  We're in a dq-cache, pass to it (described later)
  else if ptr_1k in segment-hash: We're paged
  else if ptr_1k in mmap-hash: We're mmaped
  else: throw exception "invalid mem reference"

Thus I utilize two separate hashes.. One for normal ( size <= 128K-byte 
allocations), and one for larger allocations that require mmap.  Note that 
the hash-table data-structure is _completely_ inlined with the 
data-segments.. No mini-malloc's are required.  This complicates the use of 
mmap on aligned memory requests, but that's rare enough to warrant extra 
processing.  The structures are as follows:


#define NUM_PAGES_PER_SEGMENT 255 // possibly 127
#define PAGE_SIZE 1024
#define SEGMENT_SIZE   ALIGN_1K( NUM_PAGES_PER_SEGMENT * PAGE_SIZE + sizeof( 
page_seg_t) ); 

#
#mmap-allocation:
#
struct mm_seg_t { 
  void* address; // hash-unique identifier.  contains address of &data[0]
  struct mm_seg_t* next, *prev;  // hash doubly linked list elements
  int size;  // for use in realloc
  void* data[0]; // begining of data
};

mm_seg_t          g_mm_seg_hash[ 128 ];
pthread_mutex  g_mm_seg_hash_lock;

#
#segment-allocation
#
typdef struct _page_seg_hash_t {
  // address is implicit for segments
  struct _page_seg_hash_t* next, *prev; // hash doubly linked list elements
} page_seg_hash_t;

typedef struct {
  unsigned char prev_tag_size,  tag_size, f_free;
} tag_t;

struct page_seg_t {
  seg_hash_t hash_header; // used by hashing algorithms
  // segment size is constant and thus implied
  int free_pages; // allows quick check of whether to free the segment
  tag_t boundry_tags[ NUM_PAGES_PER_SEGMENT + 2 ]; // boundry tags for each 
page + 2 delimiting (begin/end) boundry tags.
  void* data[0]; // physical location of pages.
};

page_seg_hash_t g_page_seg_hash[ 128 ];
pthread_mutex      g_page_seg_hash_lock;

#
# Free pointer chains
#
typedef struct _page_free_t {
  struct _page_free_t *next, *prev;
  page_set_t* header;
} page_free_t;

typedef struct _available_page_free_t {
  struct _available_page_free_t *next, *prev;
  int chain_size; // index into g_free_chains
} available_page_free_t;

page_free_t                  g_free_chains[ NUM_PAGES_PER_SEGMENT ];
available_page_free_t g_available_free_head;

From the above, you can see the embedding of hash datastructures.  Segments 
are directly allocated from mmap as needed (no pre-allocations on process 
init). As with:

page_seg_t* seg = mmap(0, SEGMENT_SIZE, .... );

Tags are implicitly linked to pages.  Offsets within seg->boundry_tags[] 
correspond directly to PAGE_SIZE offsets into seg->data[], and visa versa.  
If NUM_PAGES_PER_SEGMENT is <= 256, then a single byte can be used for 
relative addresses.  Thus segments max out at 256K.  Additionally, one page 
is not used (reserved for the header), such that we don't acquire 256K + 
sizeof(header).  The first and last boundry tag are set to "allocated" so as 
to avoid consolidating beyond the edges of the segment.

Allocation of a segment requires initialization of the boundry_tags array, 
which will either be 129 or 257 entries as well as insertion into the segment 
hash-table.

The mmap hash-table and segment-hashtable have separate mutex's since they 
are independant.  



I'm looking to make the hashtables static sized, with 128 buckets.  Thus, 
640Meg worth of 256 page segments (@ 1k / page )  evently distributed across 
the hash buckets will have 19 layers of linked-list lookups (in the worst 
case there would be 2500 lookups).  The hash key currently is:

int key = (addr >> 18) & 127; 

256K aligned mmaps should be linearly assigned (at least initially), and thus 
pseudo-evenly distribute throughout the hash buckets.

It's assumed that segment_size / 2 is the max practical allocation using 
pages, thus after 128K mmaps are used directly for alloc/free.

With P pages in a segment, there are only P possible memory sizes.  Thus for 
allocation, P free-list chains are used.  This avoids much of the trouble 
associated with power-of-two free-chain-lists.  The only problem is in 
searching the free-chain lists.   On initialization, there is a single 255 
page free element.  Allocating 1 page would require navigating 254 empty 
pointer chains.  But I use a doublly-linked list of filled 
free-pointer-chains.  Thus on failing to find any free page-chunks of size 1, 
the next free-pointer-list is immediately found to be of size 255.  Granted, 
this requires starting at the head of the available_free, so if you're 
looking for page_size >= 250, and there are entries in most free chains, then 
you achieve poorer results (but it is at least bounded).

The external boundry tags allows 1k-aligned memory requests with almost no 
overhead.  It also allows semi-efficient (constant time) consolidation, 
though since the segment is of a constant size I'm going to look into using a 
buddy system as was described in "Understanding the Linux Kernel".  Though 
faster, I believe the buddy system will be more wasteful of memory and less 
inclined to "grow" memory on realloc's.  I should be able to easily 
interchange the two since they're such a small part of the vmem design.

vmem_init will have to initialize both hash-tables, and the free-mem list.  
It will also have to initialize the dq-caches (below).  Spawning new threads 
might also require creating thread-local magazines for each dq-cache 
(depending on if thread_id -> cpu_id is possible, which would initialize for 
each CPU within vmem_init, though this requires locks in the magazine-caches)

Before calling mmap to allocate a new segment, reclaim is applied to the list 
of depots.  This causes a speed hickup, as well as leaving some important 
caches starved, but I can't think of any other time this would be as 
valuable.  Perhaps if a separate thread's GC  were used, but that's not 
portable.

Though this is a mismash, it should cover all the technology of the basic 
vmem layer.


Next is the slab layer.  Each slab has a centralized data structure which 
contains various configuation info (alignment, object-size, pointer chains 
for both slabs and slabs-with-free-objects, etc.).  I'm happy with my 
previous work on the slab layer, so I'm not going to post it here, other then 
to mention that each slab has a header.  This header contains at least:

typedef struct _slab_t {
   int object_size; // size of all allocations w/in this slab.  Used by 
vmem_free, vmem_sizeof, and vmem_realloc to identify the size of a void*. 
Note that this is redundant with slab_descriptor_t.
   struct _slab_t *next, *prev, *next_free, *prev_free;
   int num_free; // free objects in this slab (used to see when slab is empty)
   void* first_free; // used for slab allocation
   void* data[0]; // beginning of data
} slab_t;

typedef struct _slab_descriptor_t {
  pthread_mutex lock;
  int object_size;
  int num_pages;
  slab_t next, prev, next_free, prev_free;
  struct _slab_descriptor_t *next_desc, *prev_desc;
 ... // various precalculated info for more efficient operation
} slab_descriptor_t;

slab_descriptor_t   g_slab_descriptor_list_head;

For slab_size <= page_size, the presence of the prefixing header means that 
no object address can exist on a page boundry.  This fact is exploited by 
vmem_free(void*), vmem_sizeof(void*), and vmem_realloc(void*,new_size).  It 
can immediately skip navigation of g_page_seg_hash, go to the page-boundry 
and check the slab-size.  Since the existance of an external pointer means 
that the slab is not ready to be freed, no locking need occur for such a 
lookup.  However, in order to allocate-from or free-to the slab, the 
slab_descriptor must be locked.  As soon as a slab's num_free reaches zero, 
it is sent to vmem_free.

Each newly created slab is inserted into the slab-descriptor-list which 
allows reuse by subsequent internal calls to slab_create(..).

The next layer is the depot.  Though SUN outlined a constructor / destructor 
policy.  The most common case ( being used by dq_cache at least ) is volitile 
objects with no constructor/destructor.  Since we're not storing  valid 
objects, we are free to overwrite their contents with free-pointer-chains (as 
do the slab and vmem subsystems).  Note that this is only a special case, 
which warrants a separate set of handlers.  My previous implementation 
adhered to SUNs design, and I rename them obj_magazine_XXX and obj_depot_XXX. 
 They have no use w/in the dq_caches, and their requirements impeed 
efficiency.  Thus my simplified depot design requires a lower bound object 
size of sizeof(void*) * 2, since multiple free pointers are stored in the 
object.  SUNs design utilized a separately allocated structure called a 
magazine which was a mini-stack.  My design uses the objects as linked-lists 
for the stack.  The magazines are chained together (in the depot) by a second 
linked list stored on the top-most elements of each stack as with:

typedef struct _magazine_el_t {
  struct _magazine_el_t *next_el;
  struct _magazine_el_t* next_stack;
} magazine_el_t;

union object_t {
  magazine_el_t el;
  void* data[0];
};

typedef struct _depot_t {
  object_t* full_magazines;
  int max_stack_size; // increased after contentions for depot locks
  int max_num_magazines; // max number of magazines to hold before freeing to 
slab.
  int num_free_objects;
  pthread_mutex lock;
  struct _depot_t *next_depot, *prev_depot; // for g_depot_list;
  ... // other items in my previous implementation
} depot_t;

depot_t g_depot_list_head;

From this there is no need to maintain two lists (one of filled magazines and 
one of empty magazines), since there is no concept of an empty magazine.  
This adds to performance since you don't have to allocate magazine structures 
(which in my previous implementation did impeed "free(..)" performance, as 
well as memory usage).  All access to the depot is thread-synchronized.  But 
since each memory-request-size maps to a different depot, there is a smaller 
likelihood of contention.  Contention is further addressed by modifying 
"max_stack_size" after a failed "try_lock" on the depot.  It is reduced when 
"reclaim" calls are made.  I haven't decided how to deal with 
"max_num_magazines", since theoretically this is handled by reclaim.

Lastly the magazine-cache layer uses:

typedef struct _cache_t {
  int num_data_objects;
  int num_prev_objects;
  object_t *data, *prev;
  depot_t * depot;
} cache_t;

This is pretty close to the SUN architecture except that we're not using 
magazines, but instead the data-objects themselves as linked-list stacks.  
When depot->max_stack_size "cache_free"'s occur, then data and prev are 
swapped.  If "data" is still too full, then it's placed into the depot (after 
locking).  Instead of allocating an empty magazine, data is reset to null and 
num_data_objects is reset to zero.

I haven't completely thought out the shared readonly use of 
depot->max_stack_size.  I'm inclined to fix it to 5 for the time being (since 
I don't know how bad it is to read from this value as another thread is 
potentially over-writing it).  I might just pull a volitile copy of it into 
cache_t, and have it updated on every access to depot.   Additionally, if the 
stack-size is dynamic, then I'll have to add yet another field to 
magazine_el_t, which increases the lower bound of memory-sizes (from 8 to 12, 
or in the case of 64bit pointers, from 16 to 20).

Note that the use of constructed objects requires external magazines.  In my 
previous implementation, I held an upper-bound of 10 items / magazine, which 
meant 48 bytes / magazine.  An interesting ramification is that an object 
free periodically requires an alloc from the 48-byte cache (which thankfully 
isn't constructed which would otherwise quickly become recursive).  The 
implementation only kept a single free magazine in the depot so as not to 
waste them (since all constructed object-caches need them).

Piecing things together:
With page-size = 1024, there are 12 dq-cache sizes of 8, 12, 16, 24, 32, 48, 
64, 96, 128, 192, 256, 384. (pow 2/3).  Each thread needs a thread-local 
pointer to appropriate caches.  As before, depot_alloc / depot_free accesses 
this thread-local storage.  Thus when vmem wants to access these dq_caches, 
it uses the depot_xx functions.  I _could_ supply a vmem function varient 
which is given a pointer to the structure of thread-specific caches which 
avoid this semi-expensive indirection.


For small memory requests the flow go as follows:

vmem_alloc(sz) -> depot_alloc(..) -> cache_alloc(..) [ -> ( access depot )  [ 
-> (access slab)  [ -> vmem_alloc(slab->num_pages)  [ -> mmap(..) ] ] ] ]

Where [..] are successively rarer (though periodic) occurances.  The depot 
and slab access occurs without additional function calls.

Pure object allocations skip the first one or two steps.

For medium sized allocations the flow is (including the above use of vmem by 
slab):

vmem_alloc(sz) -> (lock g_seg_lock)  -> (traverse free-list)  [ -> (traverse 
available-free-lists to find a free-list)  [ -> seg = mmap(..)  -> (init 
segment) -> ( add seg to seg_hash) ] ]  [ -> split free chunk into two 
smaller chunks ]  -> (allocate mem-object) 

For large allocations:
vmem_alloc(sz) -> (lock g_mm_seg_lock) -> mmap(..) -> (add ptr to mm_seg_hash)

For small frees:
vmem_free(ptr) -> depot_free(..) -> cache_free(..) [ -> (access depot) [ -> 
(access slab)  [ -> vmem_free(slab) [ -> munmap(..) ] ] ] ]


For medium frees:
vmem_free(ptr) -> (lock g_seg_lock) -> (traverse g_seg_hash to find 
associated seg) -> (consolidate free chain) -> [  ((remove seg from 
g_seg_hash) -> munmap(seg) ) OR  ((attach ptr to free-list) [ -> add 
free-list to g_available_free_lists ] ) ]

For large frees:
vmem_free(ptr) -> (lock g_mm_seg_lock) -> (traverse g_mm_seg_hash) -> (delete 
munmap(...)

Note that vmem does not lock until it has to.  Most notably vmem -> cache -> 
vmem only locks on the second invocation of vmem, since the first invocation 
only analyzies the passed parameters.

One interesting thing to note.  There's no underlying need to use mmap as the 
back-end.  My previous prototype just used malloc / free for the backend.. 
One disadvantage of doing this is the double memory initialization. On the 
other hand, it can dramatically avoid OS calls (via mmap) if the underlying 
memory manager can handle 256K+ allocations in a non-locking manner. (GNU 
malloc does a good job of not blocking).

There are two obvious shortcomming.  First with dq_cache size 8, there are 
124 objects per slab.  So long as a single object remains unalloced (perhaps 
dormant on an idle thread's magazine-cache, beyond the reach of a reclaimer), 
the slab can not be reclaimed to vmem.  It's possible to allocate 1 permanent 
then 123 temporary, then 1 permanent ad infinitum.  When all the temporaries 
are freed (possibly several meg worth), the permanent allocations prevent any 
slabs from being freed (for reuse by other memory sizes, though differing 
objects of the same size can).  The only way to compensate for this is to 
utilize smaller num_object sizes.  Unfortunately if we used a smaller 
slab-stash size than a page, we'd violate our anti-fragmentation policy 
(inherent in SUNs design).  Alternatively, we could reduce the page size (to 
512 bytes) at the expense of fewer dq_caches and a higher number of mmaps 
(128K segments and 64K upper-bound prior to per-alloc mmaps).  On Linux, at 
least, this could be a heavy burden since its VM-mappings utilize linear 
linked lists (instead of Solaris's vmem-based real-time system).

The second short comming is that vmem has only two locks, no matter how many 
threads are available.  Thankfully this is only a problem for allocations 
with sizes between page_size and max_pages * page_size.  If allocations of 
that region are known to be popular, a cache should be configured for the 
appropriate sizes.  There is one data-type in particular, however that 
thwarts this.. The dynamic array (and associated hash bucket array).  This 
doubles in size on each growth, and will thus traverse several mid-ranged 
sizes  (1,2,4,8,16,32, 64, and 128).  To make up for this, it's entirely 
possible that reallocs can avoid memory copies in this central region, 
thereby reducing realloc compute-time.  Additionally, if more memory is 
needed, the segment-lock can be released while performing garbage collection 
(traversing the depot list for reclaiming, locking each depot as it goes).

In my previous experimentation, there was a measurable performance boost 
between direct object cache calls and vmem indirect calls, and an even bigger 
hit when caches used locks (which allows a reclaimer to flush magazine 
stack-values).  Thus I still maintain that cache's should not use locks, and 
that parrot's interpreter should directly utilize cache objects for all known 
fixed sized objects (which unfortunately excludes dynamic arrays).

This is a lot of stuff and is very verbal (light on actual code) so it's 
probably hard to read.  But it's at least on record for me to reference.  
This was all information I had on my white-board.

Enjoy
-Michael

vmem memory manager

Reply via email to