I don't think we'll be able to take this patch due to changes we've been making, however this is an *excellent* use case for our new pluggable engine mechanism that's opening up for development soon.
On Aug 5, 10:01 pm, "[email protected]" <[email protected]> wrote: > Hi, > here at nasza-klasa.pl, we were always a bit worried about the way > that memcached handles expired elements, which causes evictions of > unexpired entries even though there is a lot of garbage that could be > freed. > After introducing the patch below we managed to eliminate evictions > entirely without sacrificing speed, memory or cache hit/miss ratio. > > Another benefit of this patch is that if you know that there is no > garbage in memory, then you can put more trust in stats such as number > of cached items, and then decide how much RAM do you really need, or > how long should the TTLs be. > > It was already known, that a lot of memory can be wasted by expired > entries. In our case it was at least 50%, but actually it could be > arbitrary large number, as memcached always consumes all memory that > you assign to it. > > It seems that many people believed for some reason that freeing > expired entries can not be performed in O(1), or even that it requires > Omega(N) operations. Using a heap, or other priority queue data > structures can obviously give you O(log N), but the patch below aims > at O(1). > This is achieved by having one bucket for each second, and by putting > each entry in a proper bucket by looking at the expiration time. Then > at each second we just free elements from the next bucket. > This is obviously O(1) amortized time, but it can happen that you have > a lot of elements that expire at the very same moment, so the patch > offers a fixed limit for number of operations that you are willing to > perform per one request, to make the garbage collection run smoother. > > You can tweak this solutions in many ways. > > For example you do not really need one bucket for each second. You > could group elements from a same minute, or same hour, which will make > the number of buckets smaller at the cost of garbage collection > occurring a bit later. > > The patch introduces 16 additional bytes per item for doubly-linked > list in each bucket. Another change that you may consider is to use > single-linked lists for buckets, which can give you 8 bytes per item > back. > This will lead to a linear access time to buckets, so you should make > sure that each bucket is really small. > This can be achieved by using two dimensional array of buckets : one > dimension being the expiration time, and the second being some > pseudorandom hash function of a key. > We find such an improvement both risky and unnecessary, as a number of > elements on a single server multiplied by 8 bytes is not really a big > change. > > Here at nasza-klasa.pl we run memcached in a single threaded mode, so > we have not tested the patch for the multithread set-ups. We've beeb > testing the patch on production for several days right know, though. > There were no problems, evictions dropped to zero, and number of > cached items is about 50% of full capacity. The cache hit/miss ratio > is same as before the change, and the speed seems similar as well, but > we were never too much concerned about the speed so you can test it by > yourself. > > Here is the patch: > > Index: memcached-autoexpire/memcached-1.2.8/items.c > =================================================================== > --- memcached-autoexpire/memcached-1.2.8/items.c (revision 16750) > +++ memcached-autoexpire/memcached-1.2.8/items.c (revision 17295) > @@ -31,28 +31,58 @@ > rel_time_t evicted_time; > unsigned int outofmemory; > unsigned int tailrepairs; > } itemstats_t; > > static item *heads[LARGEST_ID]; > static item *tails[LARGEST_ID]; > static itemstats_t itemstats[LARGEST_ID]; > static unsigned int sizes[LARGEST_ID]; > > +/* Number of buckets should be larger than the largest TTL in use, > + * to prevent non-expired items from being processed in the cleanup > loop. > + * If this is a problem, you can change EXP_HASH macro. > + * Number of buckets should be a power of two, to allow bitwise & > instead of %, > + * when determining the bucket. > + */ > +#define EXPIRE_BUCKETS_CNT (1<<20) > + > +/* If you have TTLs larger than EXPIRE_BUCKETS_CNT, > + * you may consider replacing (x) with something like ((x)>>6), > + * to group elements in buckets that span 64 seconds, > + * and setting MAX_DELETIONS_PER_CLEANUP to a smaller value. > + */ > +#define EXP_HASH(x) ((x)%EXPIRE_BUCKETS_CNT) > +/* This value limits number of steps performed on a single request > and should be larger than 1. > + * This is used to amortize the cost of cleanup (which is constant) > in time, > + * so that we do not have to delete too many elements during a single > request, just because > + * many objects had same exptime. > + */ > +#define MAX_DELETIONS_PER_CLEANUP 10 > + > +/* These buckets are used to keep track of all elements that expire > at the same moment. > + * Each item should be stored in item->exptime%EXPIRE_BUCKETS_CNT > unless it has inifinite TTL. > + */ > +static item *expire_bucket[EXPIRE_BUCKETS_CNT]; > +#define EXPIRE_BUCKET(x) expire_bucket[EXP_HASH(x)] > + > void item_init(void) { > int i; > memset(itemstats, 0, sizeof(itemstats_t) * LARGEST_ID); > for(i = 0; i < LARGEST_ID; i++) { > heads[i] = NULL; > tails[i] = NULL; > sizes[i] = 0; > } > + //cleanup of global variables seems unnecessary, > + //but this line is here for completeness > + memset(expire_bucket, 0, sizeof(item*) * EXPIRE_BUCKETS_CNT); > } > > /* Get the next CAS id for a new item. */ > uint64_t get_cas_id() { > static uint64_t cas_id = 0; > return ++cas_id; > } > > /* Enable this for reference-count debugging. */ > #if 0 > @@ -78,30 +108,59 @@ > * > * Returns the total size of the header. > */ > static size_t item_make_header(const uint8_t nkey, const int flags, > const int nbytes, > char *suffix, uint8_t *nsuffix) { > /* suffix is defined at 40 chars elsewhere.. */ > *nsuffix = (uint8_t) snprintf(suffix, 40, " %d %d\r\n", flags, > nbytes - 2); > return sizeof(item) + nkey + *nsuffix + nbytes; > } > > +/** > + * Tries to free all expired elements from the bucket pointed by > last_cleanup_time. > + * It performs at most MAX_DELETIONS_PER_CLEANUP deletions. > + * It skips elements that are locked (refcount!=0). > + * If it reaches the end of the bucket it returns true. > + * Otherwise it signals that there is something more to do by > returning false. > + * Elements with refcount!=0 are not signalized, and do not affect > the limit of deletions. > + */ > +bool clean_up() { > + item * it, *e_next; > + uint32_t deletions_left = MAX_DELETIONS_PER_CLEANUP; > + for (it = EXPIRE_BUCKET(last_cleanup_time); it; ) { > + e_next = it->e_next; > + if (0 == it->refcount && last_cleanup_time == it->exptime) { > + if(deletions_left--){ > + do_item_unlink(it); > + }else{ > + return false; > + } > + } > + it = e_next; > + } > + return true; > +} > + > /*...@null@*/ > item *do_item_alloc(char *key, const size_t nkey, const int flags, > const rel_time_t exptime, const int nbytes) { > uint8_t nsuffix; > item *it; > char suffix[40]; > size_t ntotal = item_make_header(nkey + 1, flags, nbytes, suffix, > &nsuffix); > > unsigned int id = slabs_clsid(ntotal); > if (id == 0) > return 0; > + > + while (last_cleanup_time < current_time && clean_up()) { > + ++last_cleanup_time; > + } > > it = slabs_alloc(ntotal, id); > if (it == 0) { > int tries = 50; > item *search; > > /* If requested to not push old items out of cache when > memory runs out, > * we're out of luck at this point... > */ > > @@ -202,56 +261,77 @@ > */ > bool item_size_ok(const size_t nkey, const int flags, const int > nbytes) { > char prefix[40]; > uint8_t nsuffix; > > return slabs_clsid(item_make_header(nkey + 1, flags, nbytes, > prefix, &nsuffix)) != 0; > } > > static void item_link_q(item *it) { /* item is the new head */ > - item **head, **tail; > + item **head, **tail, **e_bucket; > + > /* always true, warns: assert(it->slabs_clsid <= LARGEST_ID); */ > assert((it->it_flags & ITEM_SLABBED) == 0); > > head = &heads[it->slabs_clsid]; > tail = &tails[it->slabs_clsid]; > assert(it != *head); > assert((*head && *tail) || (*head == 0 && *tail == 0)); > it->prev = 0; > it->next = *head; > if (it->next) it->next->prev = it; > *head = it; > if (*tail == 0) *tail = it; > + if (it->exptime) { > + e_bucket = &EXPIRE_BUCKET(it->exptime); > + it->e_prev = NULL; > + it->e_next = *e_bucket; > + if (it->e_next) it->e_next->e_prev = it; > + *e_bucket = it; > + } > sizes[it->slabs_clsid]++; > return; > } > > static void item_unlink_q(item *it) { > - item **head, **tail; > + item **head, **tail, **e_bucket; > /* always true, warns: assert(it->slabs_clsid <= LARGEST_ID); */ > head = &heads[it->slabs_clsid]; > tail = &tails[it->slabs_clsid]; > > if (*head == it) { > assert(it->prev == 0); > *head = it->next; > } > if (*tail == it) { > assert(it->next == 0); > *tail = it->prev; > } > assert(it->next != it); > assert(it->prev != it); > > if (it->next) it->next->prev = it->prev; > if (it->prev) it->prev->next = it->next; > + > + if (it->exptime) { > + e_bucket = &EXPIRE_BUCKET(it->exptime); > + if (*e_bucket == it) { > + assert(it->e_prev == NULL); > + *e_bucket = it->e_next; > + } > + assert(it->e_next != it); > + assert(it->e_prev != it); > + if (it->e_next) it->e_next->e_prev = it->e_prev; > + if (it->e_prev) it->e_prev->e_next = it->e_next; > + } > + > sizes[it->slabs_clsid]--; > return; > } > > int do_item_link(item *it) { > MEMCACHED_ITEM_LINK(ITEM_key(it), it->nbytes); > assert((it->it_flags & (ITEM_LINKED|ITEM_SLABBED)) == 0); > assert(it->nbytes < (1024 * 1024)); /* 1MB max size */ > it->it_flags |= ITEM_LINKED; > it->time = current_time; > Index: memcached-autoexpire/memcached-1.2.8/items.h > =================================================================== > --- memcached-autoexpire/memcached-1.2.8/items.h (revision 16750) > +++ memcached-autoexpire/memcached-1.2.8/items.h (revision 17289) > @@ -1,13 +1,15 @@ > /* See items.c */ > void item_init(void); > uint64_t get_cas_id(void); > + > +rel_time_t last_cleanup_time; /** relative timestamp pointing a > bucket that we should now clean up **/ > > /*...@null@*/ > item *do_item_alloc(char *key, const size_t nkey, const int flags, > const rel_time_t exptime, const int nbytes); > void item_free(item *it); > bool item_size_ok(const size_t nkey, const int flags, const int > nbytes); > > int do_item_link(item *it); /** may fail if transgresses limits > */ > void do_item_unlink(item *it); > void do_item_remove(item *it); > void do_item_update(item *it); /** update LRU time to current and > reposition */ > Index: memcached-autoexpire/memcached-1.2.8/memcached.h > =================================================================== > --- memcached-autoexpire/memcached-1.2.8/memcached.h (revision 16750) > +++ memcached-autoexpire/memcached-1.2.8/memcached.h (revision 17289) > @@ -109,20 +109,22 @@ > #define ITEM_LINKED 1 > #define ITEM_DELETED 2 > > /* temp */ > #define ITEM_SLABBED 4 > > typedef struct _stritem { > struct _stritem *next; > struct _stritem *prev; > struct _stritem *h_next; /* hash chain next */ > + struct _stritem *e_next; /* next item in the same > expire_bucket */ > + struct _stritem *e_prev; /* prev item in the same > expire_bucket */ > rel_time_t time; /* least recent access */ > rel_time_t exptime; /* expire time */ > int nbytes; /* size of data */ > unsigned short refcount; > uint8_t nsuffix; /* length of flags-and-length string > */ > uint8_t it_flags; /* ITEM_* above */ > uint8_t slabs_clsid;/* which slab class we're in */ > uint8_t nkey; /* key length, w/terminating null and > padding */ > uint64_t cas_id; /* the CAS identifier */ > void * end[];
