Re: memory allocation paper (Courtesy of Alan)

Michael L Maraist Wed, 03 Oct 2001 01:59:39 -0700

Uri Guttman wrote:

> >>>>> "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes:
>
> <off list>
>
>   DS> Ask has found us a spot for the paper Alan was speaking of.
>
>   DS> http://dev.perl.org/perl6/talks/
>
> very impressive paper. my recent proposal for a multi-queue malloc
> system has some of the ideas of this but nowhere nearly as fully
> developed. if we can drop in the libumem implementation directly, i
> think it will be a major win. it doesn't look too difficult for us to
> write our own version (with some parrot centric features maybe). the
> thread (vs. cpu) support looks very good too. i like how they get
> constant or linear time results.
>
> uri


I was facinated by the concepts and implications so I wrote several test
utilities based on a 466MHZ dual processor celeron running Linux RedHat 7.1.

The paper shows a benchmark based on a 10cpu Solaris Machine that is a large
loop of allocs and frees (of an unspecified size, assumed to be 4bytes).

The paper shows single threaded throughput of .5Mega cycles / second growing
to an upper bound of 3 Mc / second.

I ran several combinations of a simple loop of:

thr_main:
  for i .. size:
    ptr = alloc(4); *ptr = 1; free(ptr);

main:
  for i .. num_threads:
    pthread_create( thr_main )

The first thing to note is that using glibc's non-MT malloc I achieved
2.67Mega cycles / second.  Using the perl5 memory allocator (non-MT), I
achieved 8.4Mega cycles / second.  In both cases using pthreads (via
uncommenting thread-specific code and linking -lpthread), the throughput fell
to 0.978M / second ( for both perl5 and glibc ).  Perhaps I'm not successfully
mapping alloc to perl - I tried various alternative names for the
allocator (such as New, Perl_xxx, etc).

I obviously didn't have libumem to play with, but I did want to see what the
lower-bound was for the various operations.  I found some strange results.

First I wrote an unlocked void* MyAlloc(int), void MyFree(void*) pair which
simply used a stack for allocation.  It wasn't a valid memory allocator, since
it assumes that you free in reverse order but it sufficied.  It achieved 20
million cycles / second.

Next I added mutex's, which slowed the single thread down to 1.3Mega cycles/
second.

The next thing I did was to split resources into separate arrays which are
accessed via a hashing of the thread-id.  This required (int)pthread_self().
The existing implementation is that thread_t is just an int, but it doesn't
give me warm fuzzies doing this (how portable could that be).  Strangely
enough Linux's thread_id's are wildly jumping numbers (starting at over a
thousand with at least a thousand numeric values separating each).  The hash
was a simple masking of the right several most bits (3 in my test case since I
wasn't to test more than 10 simultaneous threads).

Very quickly I found cache-line-contention to be a problem (just as the paper
suggested).  Since I was using separate arrays of integers, the cache-lines
were hopelessly shared.  My interoom solution was to multiply the thread_id by
some value ( 8 or 16 work well ), at the expense of unused array elements.
The following are some benchmarks (that I can still read from my
scratch-sheets):

Num-threads    |    run-time for thread-multiplier = 16 (pad)    |    run-time
for thread-multiplier = 2    |    mutex's with multiplier=16
1    |    1.52s    |    1.52s    |    8.05s
2    |    8.25s    |    8.6s    |    16.14s
3    |    5.6s    |    12.99s    |    16.18s
4    |    3.86s    |    17.23s    |    21s
5    |    5.04s    |            |    23.19s
6    |    6.7s
7    |    6.64s

One comment, the left-most data-set has some wild-swings; the more common
occurance were something like:
1    |    1.52s    |    6.4Mega cycles / s
2    |    8s    |    2.5Mc/s
3    |    5s    |    6Mc/s
4    |    5.5s    |    7.2Mc/s
5    |    6s    |    8.3Mc/s
6    |    6.5s    |    ...
etc

The above are based on the top of my head from the types of numbers I've seen
after hundreds of tweaks and unrecorded runs.

Two things seemed true.  First, the more threads that were used, the more
efficient it became.  Secondly the second thread always seemed to provide the
slowest / least-efficient operation, without exception.  I can't explain
this.  It seemed impervious to cache-line-width padding.  Note in the
padding-times table that two threads took roughly the same time for both 2 and
16x thread_id multipliers.  When only either a 1 or 2x muliplier, the run-time
for 2 threads seems to fall in line with the other thread-values.  The
1-thread operation is obviously faster because there is no cache-line
contention.  I'm curious to learn if this is a general problem with either
Linux, x86, the dual-processor celerons or something more subtle.

At this point I was using a two level memory manager.  The first level was a
thread-specific magazine allocator.  In this primative memory system, there
was no need for "data" / "previous" as described in the SUN paper, and instead
a single magazine array / top-of-magazine-stack was allocated for each
thread.  The second layer was a slab-allocator.  Again in this simple test,
there is no attempt made for freeing slab elements.  Thus a static,
fixed-sized stack was used for slab-requests.  Since slab-requests are rare
(in this example, only once per thread), and since multiple threads will
attempt to call apon it, a slab specific mutex was used.

I output the hash-values of the thread-ids (which were used as scaled indexes
into the thread-specific magazine data-structures).  As could be expected,
various numbers of threads produced over-lapping hash-values.  This is
dangerous since it means multiple threads will utilize the same magazine
structures.  This means that each thread MUST lock when entering even the
magazine allocator / deallocator.  For reference, the SUN paper says that
mutex's are needed for multiple reasons, and the stated example was that the
dynamic magazine-size must be periodically changed by a single thread for all
CPUs (threads mapped to a subset of magazine structures).  The paper justifies
this overhead by saying that locking a cpu-specific thread that has little or
no contention (in the context, the only contention existed when magazine-sizes
were changed) has trivial overhead.  Unfortunately in user-space a mutex seems
to require at least a function-call at best, and at worst a system call.  Now
I'm sure that solaris has some nicely optimized user-threading routines (for
get-thread-id and locks /unlocks), but for the Linux pthread library, luck is
not on our side. (At least not unless there are some asm includes hidden
somewhere).

Finally I designed my real target system.  The goals were as follows:
o thread-info is stored inside parrot interpreter data-structures (avoids
thread-specific info and the dreaded pthread_self() with its unreliable
values)
o the magazine layer is handled completely within the interpreter
data-structures
o prevent locking on the common execution path (providing the fastest possible
optimizations)
o allow global locking on less used operations (such as depot, slab, vmem)
o the magazine allocator /deallocator requires the thread/interpreter-specific
magazine datastructure be passed (breaking malloc compatibility).  This is
justifiable since all parrot code can be required to accept a high-level
parrot interpreter with little extra over-head.
o for backward / external compatibility, provide a stand-alone magazine
structure that serves as single-threaded access to traditional malloc.  In
other words, "char*malloc(int size)" is a wrapper for
"char*Parrot_Malloc(g_shared_magazine, size)".  The only reason this would be
necessary is for mal-behaved XS code and or linked c-libraries.  It's possible
to apply a CPU-specific architecture for the special case of malloc (as with
libumem), but this reserves a rather large amount of resources and requires a
higher setup time for such [hopefully] rarely used utility.


A rough cut of the this code (still using a primative magazine and
slab-allocator) is attached.  The numbers are pretty good.

1    |    .66s    |    15M c/s
2    |    3.7s    |    5.3Mc/s
3    |    2.54s    |    11.8Mc/s
4    |    2.87s    |    13Mc/s
5    |    3.09s    |    16Mc/s
6    |    3.18s    |    18Mc/s
10    |    4.73s    |    21Mc/s

Note that the values fluxuate by 1.5seconds or more for thread values greater
than 2 (not shown), while remaining fairly constant for the other 2.

It's not surprising that the theoretical max was achieved for a large enough
set of threads, since pairs of alloc/free is simply two function calls, two
array accesses and two index modifications.  Essentially a handfull of
assembly language operations.  Guestimating this at a dozen or so assembly
instructions, this presumes roughly .25 ops / clock / cpu on my machine.

The point of this exercize was to explore a skeleton implementation on Linux
with our current needs (the non-standard memory manager interface with a
back-door for external code).  I believe that the comparison of the above
benchmarks suggests that my stated set of goals at least warrant further
investigation and possibly implementation within Parrot code.  The Solaris
implementation does not directly perform efficiently on the Linux platform
with the existing pthreads library.  It is assumed that libumem will only
perform well on the Solaris platform and may stand as a compile candidate
there.  I don't know if libumem is available on existing latest-generation
Solaris platforms but I'll check when I have time.  I am assuming that even
with solaris optimizations (with solaris-threads) the overhead of locking will
hurt performance.  The main saving grace is that libumem may provide a better
posix compliant memory manager (for functions without access to the per-thread
interpreter).

Another key point I wanted to make is that the Solaris vmem architecture is
slated to provide arbitrary memory alignment with zero slack-space.
Unfortunately vmem strickly holds to external-boundry-tags which trades memory
/ alignment purity for wasted memory space.  In fact, the paper suggests that
for user-space, vmem is foregone alltogether, and that the slab allocator
makes direct calls to sbrk / mmap which in turn makes use of vmem albeit
through a system-call.  If mmap was fast enough this might be feasible, but
sbrk would require some other form of user-space memory mapper.  Linux might
not like having thousands of memory maps (one for each 2+K memory request).
Not to mention this might not be very portable to other OSes.

There are dozens of variations on the user-space vmem architecture, but one
feature that I really like is that vmem requests seem to be partial to
page-boundries (the kernal vmem likes to function exclusively on groups of
pages).  A page is a very nicely aligned memory reference.  This coulped with
the use of object-caches (fixed-sized memory requests that optionally have
constructors / destructors applied at appropriate and efficient times), could
provide all the alignment that an application could want.

The only issue I'm currently worried about is that if the interpreter houses
the per-thread structures, then there will be an awful lot of them -
considering the concept of quantum caches;  30+ fixed-sized dynamicaly
instantiated power-of-2 caches.  Each of these would have to maintain
references within each interpreter.  By checking for a null-pointer and
instantiating the object-cache on-demand, the parrot-interpreter-intialization
shouldn't be that bad, but a parrot-interpreter-destruction could be murder.
libumem doesn't have to worry about this since they are CPU-specific, not
thread-speicific, but then libumem pays a price if the app never becomes
multi-threaded.

I had other gripes, but this is a lot to obsorb / comment on / flame.. :)

-Michael

#include<stdio.h>
#include<stdlib.h>

#include<pthread.h>

#define MLOCK(lck) pthread_mutex_lock(&lck);
#define MUNLOCK(lck) pthread_mutex_unlock(&lck);


int size = 10000000;
int num_threads = 1;

typedef char* cp;

typedef struct { 
  pthread_mutex_t lck;
  cp* mag;
  int mag_top, id;
  int pad1,pad2,pad3,pad4; // cache-alignment of 16B
} interp_t;

interp_t interps[30];

pthread_mutex_t slab_lck; // = PTHREAD_MUTEX_INITIALIZER;


char slab_mem[160024];
int slab_top = 0;
int slab_last_size;

void
slab_init()
{
  pthread_mutex_init(&slab_lck,NULL);
}

char*
slab_alloc(int size)
{
  char* ptr;
  MLOCK(slab_lck);
  ptr = slab_mem + slab_top;
  slab_top += size;
  slab_last_size = size;
  printf("slab-ret-val=%p, slab_base=%p, slab_top=%i, size=%i\n",ptr, slab_mem, 
slab_top, size);
  MUNLOCK(slab_lck);

  return (void*)ptr;
}

int interp_top = 0;

pthread_mutex_t interp_lock; // = PTHREAD_MUTEX_INITIALIZER;

void
perl_init()
{
  pthread_mutex_init(&interp_lock, NULL );  
}

void
init_interp(interp_t* interp)
{
  MLOCK(interp_lock);
  pthread_mutex_init(&interp->lck, NULL); // unused.
  interp->mag = (char**)slab_alloc(10);
  interp->mag_top = 0;

  interp->id = interp_top++;
  MUNLOCK(interp_lock);
   
}

void* MyAlloc(interp_t* interp, int size)
{
  void* ptr;
  //MLOCK( interp->lck );
  if ( interp->mag_top > 0 ) {
    ptr = interp->mag[ --(interp->mag_top) ];
  } else {
    ptr = slab_alloc( size );
  }
  //MUNLOCK(interp->lck);
  return ptr;
}

void MyFree(interp_t* interp, void* ptr)
{
  //MLOCK(interp->lck);
  interp->mag[ (interp->mag_top)++ ] = ptr;
  //MUNLOCK(interp->lck);
}

void* MySyncAlloc(interp_t* interp, int size)
{
  void* ptr;
  MLOCK( interp->lck );
  if ( interp->mag_top > 0 ) {
    ptr = interp->mag[ --(interp->mag_top) ];
  } else {
    ptr = slab_alloc( size );
  }
  MUNLOCK(interp->lck);
  return ptr;
}

void MySyncFree(interp_t* interp, void* ptr)
{
  MLOCK(interp->lck);
  interp->mag[ (interp->mag_top)++ ] = ptr;
  MUNLOCK(interp->lck);
}

void* MyGlobalAlloc(int size)
{
  return MySyncAlloc( interps, size );
}

void MyGlobalFree(void* ptr)
{
  MySyncFree( interps, ptr );
}


void*
doloop(void* arg)
{
  interp_t * interp = (interp_t*)arg;
  int retval;
  int msize = size;
  int*ptr;
  int*last_ptr = NULL;
  for (;msize > 0; msize--) {
    //ptr = (int*)MyAlloc(interp, 100);
    ptr = (int*)MyGlobalAlloc( 100);
    //printf("alloced on iterp-id %i\n", interp->id);
    if (!ptr)
      printf("null ptr val\n");
    if (last_ptr != ptr)
      printf("interp-id %i ptr changed from %p to %p.\n", interp->id, last_ptr, ptr);
    *ptr = 1;
    //MyFree(interp, ptr);
    MyGlobalFree( ptr);
    last_ptr = ptr;
  }
  printf("done with thread %i\n", interp->id);
  return arg;
}


int
main(int argc)
{

  pthread_t thread[30];
  pthread_attr_t attr;

  int arg = 5;
  int foo = 9;
  void* ret_val = &foo;
  int i;

  slab_init();

  init_interp(interps); // global interp

  if ( argc > 1 ) {
    num_threads = argc -1;
    pthread_attr_init(&attr);
    for (i = 0; i < argc-1; i++ ) {
      init_interp(interps + i + 1);
      pthread_create( thread + i, &attr, doloop, interps + i );
    }
    for (i = 0; i < argc-1; i++ ) {
      pthread_join(thread[i], &ret_val);
      printf("thr joined\n");
    }
  } else {
    doloop(interps);
  }  
  printf("Done\n");
  return 0;
}

Re: memory allocation paper (Courtesy of Alan)

Reply via email to