Uri Guttman wrote: > >>>>> "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes: > > <off list> > > DS> Ask has found us a spot for the paper Alan was speaking of. > > DS> http://dev.perl.org/perl6/talks/ > > very impressive paper. my recent proposal for a multi-queue malloc > system has some of the ideas of this but nowhere nearly as fully > developed. if we can drop in the libumem implementation directly, i > think it will be a major win. it doesn't look too difficult for us to > write our own version (with some parrot centric features maybe). the > thread (vs. cpu) support looks very good too. i like how they get > constant or linear time results. > > uri
I was facinated by the concepts and implications so I wrote several test utilities based on a 466MHZ dual processor celeron running Linux RedHat 7.1. The paper shows a benchmark based on a 10cpu Solaris Machine that is a large loop of allocs and frees (of an unspecified size, assumed to be 4bytes). The paper shows single threaded throughput of .5Mega cycles / second growing to an upper bound of 3 Mc / second. I ran several combinations of a simple loop of: thr_main: for i .. size: ptr = alloc(4); *ptr = 1; free(ptr); main: for i .. num_threads: pthread_create( thr_main ) The first thing to note is that using glibc's non-MT malloc I achieved 2.67Mega cycles / second. Using the perl5 memory allocator (non-MT), I achieved 8.4Mega cycles / second. In both cases using pthreads (via uncommenting thread-specific code and linking -lpthread), the throughput fell to 0.978M / second ( for both perl5 and glibc ). Perhaps I'm not successfully mapping alloc to perl - I tried various alternative names for the allocator (such as New, Perl_xxx, etc). I obviously didn't have libumem to play with, but I did want to see what the lower-bound was for the various operations. I found some strange results. First I wrote an unlocked void* MyAlloc(int), void MyFree(void*) pair which simply used a stack for allocation. It wasn't a valid memory allocator, since it assumes that you free in reverse order but it sufficied. It achieved 20 million cycles / second. Next I added mutex's, which slowed the single thread down to 1.3Mega cycles/ second. The next thing I did was to split resources into separate arrays which are accessed via a hashing of the thread-id. This required (int)pthread_self(). The existing implementation is that thread_t is just an int, but it doesn't give me warm fuzzies doing this (how portable could that be). Strangely enough Linux's thread_id's are wildly jumping numbers (starting at over a thousand with at least a thousand numeric values separating each). The hash was a simple masking of the right several most bits (3 in my test case since I wasn't to test more than 10 simultaneous threads). Very quickly I found cache-line-contention to be a problem (just as the paper suggested). Since I was using separate arrays of integers, the cache-lines were hopelessly shared. My interoom solution was to multiply the thread_id by some value ( 8 or 16 work well ), at the expense of unused array elements. The following are some benchmarks (that I can still read from my scratch-sheets): Num-threads | run-time for thread-multiplier = 16 (pad) | run-time for thread-multiplier = 2 | mutex's with multiplier=16 1 | 1.52s | 1.52s | 8.05s 2 | 8.25s | 8.6s | 16.14s 3 | 5.6s | 12.99s | 16.18s 4 | 3.86s | 17.23s | 21s 5 | 5.04s | | 23.19s 6 | 6.7s 7 | 6.64s One comment, the left-most data-set has some wild-swings; the more common occurance were something like: 1 | 1.52s | 6.4Mega cycles / s 2 | 8s | 2.5Mc/s 3 | 5s | 6Mc/s 4 | 5.5s | 7.2Mc/s 5 | 6s | 8.3Mc/s 6 | 6.5s | ... etc The above are based on the top of my head from the types of numbers I've seen after hundreds of tweaks and unrecorded runs. Two things seemed true. First, the more threads that were used, the more efficient it became. Secondly the second thread always seemed to provide the slowest / least-efficient operation, without exception. I can't explain this. It seemed impervious to cache-line-width padding. Note in the padding-times table that two threads took roughly the same time for both 2 and 16x thread_id multipliers. When only either a 1 or 2x muliplier, the run-time for 2 threads seems to fall in line with the other thread-values. The 1-thread operation is obviously faster because there is no cache-line contention. I'm curious to learn if this is a general problem with either Linux, x86, the dual-processor celerons or something more subtle. At this point I was using a two level memory manager. The first level was a thread-specific magazine allocator. In this primative memory system, there was no need for "data" / "previous" as described in the SUN paper, and instead a single magazine array / top-of-magazine-stack was allocated for each thread. The second layer was a slab-allocator. Again in this simple test, there is no attempt made for freeing slab elements. Thus a static, fixed-sized stack was used for slab-requests. Since slab-requests are rare (in this example, only once per thread), and since multiple threads will attempt to call apon it, a slab specific mutex was used. I output the hash-values of the thread-ids (which were used as scaled indexes into the thread-specific magazine data-structures). As could be expected, various numbers of threads produced over-lapping hash-values. This is dangerous since it means multiple threads will utilize the same magazine structures. This means that each thread MUST lock when entering even the magazine allocator / deallocator. For reference, the SUN paper says that mutex's are needed for multiple reasons, and the stated example was that the dynamic magazine-size must be periodically changed by a single thread for all CPUs (threads mapped to a subset of magazine structures). The paper justifies this overhead by saying that locking a cpu-specific thread that has little or no contention (in the context, the only contention existed when magazine-sizes were changed) has trivial overhead. Unfortunately in user-space a mutex seems to require at least a function-call at best, and at worst a system call. Now I'm sure that solaris has some nicely optimized user-threading routines (for get-thread-id and locks /unlocks), but for the Linux pthread library, luck is not on our side. (At least not unless there are some asm includes hidden somewhere). Finally I designed my real target system. The goals were as follows: o thread-info is stored inside parrot interpreter data-structures (avoids thread-specific info and the dreaded pthread_self() with its unreliable values) o the magazine layer is handled completely within the interpreter data-structures o prevent locking on the common execution path (providing the fastest possible optimizations) o allow global locking on less used operations (such as depot, slab, vmem) o the magazine allocator /deallocator requires the thread/interpreter-specific magazine datastructure be passed (breaking malloc compatibility). This is justifiable since all parrot code can be required to accept a high-level parrot interpreter with little extra over-head. o for backward / external compatibility, provide a stand-alone magazine structure that serves as single-threaded access to traditional malloc. In other words, "char*malloc(int size)" is a wrapper for "char*Parrot_Malloc(g_shared_magazine, size)". The only reason this would be necessary is for mal-behaved XS code and or linked c-libraries. It's possible to apply a CPU-specific architecture for the special case of malloc (as with libumem), but this reserves a rather large amount of resources and requires a higher setup time for such [hopefully] rarely used utility. A rough cut of the this code (still using a primative magazine and slab-allocator) is attached. The numbers are pretty good. 1 | .66s | 15M c/s 2 | 3.7s | 5.3Mc/s 3 | 2.54s | 11.8Mc/s 4 | 2.87s | 13Mc/s 5 | 3.09s | 16Mc/s 6 | 3.18s | 18Mc/s 10 | 4.73s | 21Mc/s Note that the values fluxuate by 1.5seconds or more for thread values greater than 2 (not shown), while remaining fairly constant for the other 2. It's not surprising that the theoretical max was achieved for a large enough set of threads, since pairs of alloc/free is simply two function calls, two array accesses and two index modifications. Essentially a handfull of assembly language operations. Guestimating this at a dozen or so assembly instructions, this presumes roughly .25 ops / clock / cpu on my machine. The point of this exercize was to explore a skeleton implementation on Linux with our current needs (the non-standard memory manager interface with a back-door for external code). I believe that the comparison of the above benchmarks suggests that my stated set of goals at least warrant further investigation and possibly implementation within Parrot code. The Solaris implementation does not directly perform efficiently on the Linux platform with the existing pthreads library. It is assumed that libumem will only perform well on the Solaris platform and may stand as a compile candidate there. I don't know if libumem is available on existing latest-generation Solaris platforms but I'll check when I have time. I am assuming that even with solaris optimizations (with solaris-threads) the overhead of locking will hurt performance. The main saving grace is that libumem may provide a better posix compliant memory manager (for functions without access to the per-thread interpreter). Another key point I wanted to make is that the Solaris vmem architecture is slated to provide arbitrary memory alignment with zero slack-space. Unfortunately vmem strickly holds to external-boundry-tags which trades memory / alignment purity for wasted memory space. In fact, the paper suggests that for user-space, vmem is foregone alltogether, and that the slab allocator makes direct calls to sbrk / mmap which in turn makes use of vmem albeit through a system-call. If mmap was fast enough this might be feasible, but sbrk would require some other form of user-space memory mapper. Linux might not like having thousands of memory maps (one for each 2+K memory request). Not to mention this might not be very portable to other OSes. There are dozens of variations on the user-space vmem architecture, but one feature that I really like is that vmem requests seem to be partial to page-boundries (the kernal vmem likes to function exclusively on groups of pages). A page is a very nicely aligned memory reference. This coulped with the use of object-caches (fixed-sized memory requests that optionally have constructors / destructors applied at appropriate and efficient times), could provide all the alignment that an application could want. The only issue I'm currently worried about is that if the interpreter houses the per-thread structures, then there will be an awful lot of them - considering the concept of quantum caches; 30+ fixed-sized dynamicaly instantiated power-of-2 caches. Each of these would have to maintain references within each interpreter. By checking for a null-pointer and instantiating the object-cache on-demand, the parrot-interpreter-intialization shouldn't be that bad, but a parrot-interpreter-destruction could be murder. libumem doesn't have to worry about this since they are CPU-specific, not thread-speicific, but then libumem pays a price if the app never becomes multi-threaded. I had other gripes, but this is a lot to obsorb / comment on / flame.. :) -Michael
#include<stdio.h> #include<stdlib.h> #include<pthread.h> #define MLOCK(lck) pthread_mutex_lock(&lck); #define MUNLOCK(lck) pthread_mutex_unlock(&lck); int size = 10000000; int num_threads = 1; typedef char* cp; typedef struct { pthread_mutex_t lck; cp* mag; int mag_top, id; int pad1,pad2,pad3,pad4; // cache-alignment of 16B } interp_t; interp_t interps[30]; pthread_mutex_t slab_lck; // = PTHREAD_MUTEX_INITIALIZER; char slab_mem[160024]; int slab_top = 0; int slab_last_size; void slab_init() { pthread_mutex_init(&slab_lck,NULL); } char* slab_alloc(int size) { char* ptr; MLOCK(slab_lck); ptr = slab_mem + slab_top; slab_top += size; slab_last_size = size; printf("slab-ret-val=%p, slab_base=%p, slab_top=%i, size=%i\n",ptr, slab_mem, slab_top, size); MUNLOCK(slab_lck); return (void*)ptr; } int interp_top = 0; pthread_mutex_t interp_lock; // = PTHREAD_MUTEX_INITIALIZER; void perl_init() { pthread_mutex_init(&interp_lock, NULL ); } void init_interp(interp_t* interp) { MLOCK(interp_lock); pthread_mutex_init(&interp->lck, NULL); // unused. interp->mag = (char**)slab_alloc(10); interp->mag_top = 0; interp->id = interp_top++; MUNLOCK(interp_lock); } void* MyAlloc(interp_t* interp, int size) { void* ptr; //MLOCK( interp->lck ); if ( interp->mag_top > 0 ) { ptr = interp->mag[ --(interp->mag_top) ]; } else { ptr = slab_alloc( size ); } //MUNLOCK(interp->lck); return ptr; } void MyFree(interp_t* interp, void* ptr) { //MLOCK(interp->lck); interp->mag[ (interp->mag_top)++ ] = ptr; //MUNLOCK(interp->lck); } void* MySyncAlloc(interp_t* interp, int size) { void* ptr; MLOCK( interp->lck ); if ( interp->mag_top > 0 ) { ptr = interp->mag[ --(interp->mag_top) ]; } else { ptr = slab_alloc( size ); } MUNLOCK(interp->lck); return ptr; } void MySyncFree(interp_t* interp, void* ptr) { MLOCK(interp->lck); interp->mag[ (interp->mag_top)++ ] = ptr; MUNLOCK(interp->lck); } void* MyGlobalAlloc(int size) { return MySyncAlloc( interps, size ); } void MyGlobalFree(void* ptr) { MySyncFree( interps, ptr ); } void* doloop(void* arg) { interp_t * interp = (interp_t*)arg; int retval; int msize = size; int*ptr; int*last_ptr = NULL; for (;msize > 0; msize--) { //ptr = (int*)MyAlloc(interp, 100); ptr = (int*)MyGlobalAlloc( 100); //printf("alloced on iterp-id %i\n", interp->id); if (!ptr) printf("null ptr val\n"); if (last_ptr != ptr) printf("interp-id %i ptr changed from %p to %p.\n", interp->id, last_ptr, ptr); *ptr = 1; //MyFree(interp, ptr); MyGlobalFree( ptr); last_ptr = ptr; } printf("done with thread %i\n", interp->id); return arg; } int main(int argc) { pthread_t thread[30]; pthread_attr_t attr; int arg = 5; int foo = 9; void* ret_val = &foo; int i; slab_init(); init_interp(interps); // global interp if ( argc > 1 ) { num_threads = argc -1; pthread_attr_init(&attr); for (i = 0; i < argc-1; i++ ) { init_interp(interps + i + 1); pthread_create( thread + i, &attr, doloop, interps + i ); } for (i = 0; i < argc-1; i++ ) { pthread_join(thread[i], &ret_val); printf("thr joined\n"); } } else { doloop(interps); } printf("Done\n"); return 0; }