I don't think we'll be able to take this patch due to changes we've
been making, however this is an *excellent* use case for our new
pluggable engine mechanism that's opening up for development soon.

On Aug 5, 10:01 pm, "[email protected]"
<[email protected]> wrote:
> Hi,
> here at nasza-klasa.pl, we were always a bit worried about the way
> that memcached handles expired elements, which causes evictions of
> unexpired entries even though there is a lot of garbage that could be
> freed.
> After introducing the patch below we managed to eliminate evictions
> entirely without sacrificing speed, memory or cache hit/miss ratio.
>
> Another benefit of this patch is that if you know that there is no
> garbage in memory, then you can put more trust in stats such as number
> of cached items, and then decide how much RAM do you really need, or
> how long should the TTLs be.
>
> It was already known, that a lot of memory can be wasted by expired
> entries. In our case it was at least 50%, but actually it could be
> arbitrary large number, as memcached always consumes all memory that
> you assign to it.
>
> It seems that many people believed for some reason that freeing
> expired entries can not be performed in O(1), or even that it requires
> Omega(N) operations. Using a heap, or other priority queue data
> structures can obviously give you O(log N), but the patch below aims
> at O(1).
> This is achieved by having one bucket for each second, and by putting
> each entry in a proper bucket by looking at the expiration time. Then
> at each second we just free elements from the next bucket.
> This is obviously O(1) amortized time, but it can happen that you have
> a lot of elements that expire at the very same moment, so the patch
> offers a fixed limit for number of operations that you are willing to
> perform per one request, to make the garbage collection run smoother.
>
> You can tweak this solutions in many ways.
>
> For example you do not really need one bucket for each second. You
> could group elements from a same minute, or same hour, which will make
> the number of buckets smaller at the cost of garbage collection
> occurring a bit later.
>
> The patch introduces 16 additional bytes per item for doubly-linked
> list in each bucket. Another change that you may consider is to use
> single-linked lists for buckets, which can give you 8 bytes per item
> back.
> This will lead to a linear access time to buckets, so you should make
> sure that each bucket is really small.
> This can be achieved by using two dimensional array of buckets : one
> dimension being the expiration time, and the second being some
> pseudorandom hash function of a key.
> We find such an improvement both risky and unnecessary, as a number of
> elements on a single server multiplied by 8 bytes is not really a big
> change.
>
> Here at nasza-klasa.pl we run memcached in a single threaded mode, so
> we have not tested the patch for the multithread set-ups. We've beeb
> testing the patch on production for several days right know, though.
> There were no problems, evictions dropped to zero, and number of
> cached items is about 50% of full capacity. The cache hit/miss ratio
> is same as before the change, and the speed seems similar as well, but
> we were never too much concerned about the speed so you can test it by
> yourself.
>
> Here is the patch:
>
> Index: memcached-autoexpire/memcached-1.2.8/items.c
> ===================================================================
> --- memcached-autoexpire/memcached-1.2.8/items.c (revision 16750)
> +++ memcached-autoexpire/memcached-1.2.8/items.c (revision 17295)
> @@ -31,28 +31,58 @@
>      rel_time_t evicted_time;
>      unsigned int outofmemory;
>      unsigned int tailrepairs;
>  } itemstats_t;
>
>  static item *heads[LARGEST_ID];
>  static item *tails[LARGEST_ID];
>  static itemstats_t itemstats[LARGEST_ID];
>  static unsigned int sizes[LARGEST_ID];
>
> +/* Number of buckets should be larger than the largest TTL in use,
> + * to prevent non-expired items from being processed in the cleanup
> loop.
> + * If this is a problem, you can change EXP_HASH macro.
> + * Number of buckets should be a power of two, to allow bitwise &
> instead of %,
> + * when determining the bucket.
> + */
> +#define EXPIRE_BUCKETS_CNT (1<<20)
> +
> +/* If you have TTLs larger than EXPIRE_BUCKETS_CNT,
> + * you may consider replacing (x) with something like ((x)>>6),
> + * to group elements in buckets that span 64 seconds,
> + * and setting MAX_DELETIONS_PER_CLEANUP to a smaller value.
> + */
> +#define EXP_HASH(x) ((x)%EXPIRE_BUCKETS_CNT)
> +/* This value limits number of steps performed on a single request
> and should be larger than 1.
> + * This is used to amortize the cost of cleanup (which is constant)
> in time,
> + * so that we do not have to delete too many elements during a single
> request, just because
> + * many objects had same exptime.
> + */
> +#define MAX_DELETIONS_PER_CLEANUP 10
> +
> +/* These buckets are used to keep track of all elements that expire
> at the same moment.
> + * Each item should be stored in item->exptime%EXPIRE_BUCKETS_CNT
> unless it has inifinite TTL.
> + */
> +static item *expire_bucket[EXPIRE_BUCKETS_CNT];
> +#define EXPIRE_BUCKET(x) expire_bucket[EXP_HASH(x)]
> +
>  void item_init(void) {
>      int i;
>      memset(itemstats, 0, sizeof(itemstats_t) * LARGEST_ID);
>      for(i = 0; i < LARGEST_ID; i++) {
>          heads[i] = NULL;
>          tails[i] = NULL;
>          sizes[i] = 0;
>      }
> +    //cleanup of global variables seems unnecessary,
> +    //but this line is here for completeness
> +    memset(expire_bucket, 0, sizeof(item*) * EXPIRE_BUCKETS_CNT);
>  }
>
>  /* Get the next CAS id for a new item. */
>  uint64_t get_cas_id() {
>      static uint64_t cas_id = 0;
>      return ++cas_id;
>  }
>
>  /* Enable this for reference-count debugging. */
>  #if 0
> @@ -78,30 +108,59 @@
>   *
>   * Returns the total size of the header.
>   */
>  static size_t item_make_header(const uint8_t nkey, const int flags,
> const int nbytes,
>                       char *suffix, uint8_t *nsuffix) {
>      /* suffix is defined at 40 chars elsewhere.. */
>      *nsuffix = (uint8_t) snprintf(suffix, 40, " %d %d\r\n", flags,
> nbytes - 2);
>      return sizeof(item) + nkey + *nsuffix + nbytes;
>  }
>
> +/**
> + * Tries to free all expired elements from the bucket pointed by
> last_cleanup_time.
> + * It performs at most MAX_DELETIONS_PER_CLEANUP deletions.
> + * It skips elements that are locked (refcount!=0).
> + * If it reaches the end of the bucket it returns true.
> + * Otherwise it signals that there is something more to do by
> returning false.
> + * Elements with refcount!=0 are not signalized, and do not affect
> the limit of deletions.
> + */
> +bool clean_up() {
> +    item * it, *e_next;
> +    uint32_t deletions_left = MAX_DELETIONS_PER_CLEANUP;
> +    for (it = EXPIRE_BUCKET(last_cleanup_time); it; ) {
> +      e_next = it->e_next;
> +      if (0 == it->refcount && last_cleanup_time == it->exptime) {
> +          if(deletions_left--){
> +              do_item_unlink(it);
> +          }else{
> +              return false;
> +          }
> +      }
> +      it = e_next;
> +    }
> +    return true;
> +}
> +
>  /*...@null@*/
>  item *do_item_alloc(char *key, const size_t nkey, const int flags,
> const rel_time_t exptime, const int nbytes) {
>      uint8_t nsuffix;
>      item *it;
>      char suffix[40];
>      size_t ntotal = item_make_header(nkey + 1, flags, nbytes, suffix,
> &nsuffix);
>
>      unsigned int id = slabs_clsid(ntotal);
>      if (id == 0)
>          return 0;
> +
> +    while (last_cleanup_time < current_time && clean_up()) {
> +        ++last_cleanup_time;
> +    }
>
>      it = slabs_alloc(ntotal, id);
>      if (it == 0) {
>          int tries = 50;
>          item *search;
>
>          /* If requested to not push old items out of cache when
> memory runs out,
>           * we're out of luck at this point...
>           */
>
> @@ -202,56 +261,77 @@
>   */
>  bool item_size_ok(const size_t nkey, const int flags, const int
> nbytes) {
>      char prefix[40];
>      uint8_t nsuffix;
>
>      return slabs_clsid(item_make_header(nkey + 1, flags, nbytes,
>                                          prefix, &nsuffix)) != 0;
>  }
>
>  static void item_link_q(item *it) { /* item is the new head */
> -    item **head, **tail;
> +    item **head, **tail, **e_bucket;
> +
>      /* always true, warns: assert(it->slabs_clsid <= LARGEST_ID); */
>      assert((it->it_flags & ITEM_SLABBED) == 0);
>
>      head = &heads[it->slabs_clsid];
>      tail = &tails[it->slabs_clsid];
>      assert(it != *head);
>      assert((*head && *tail) || (*head == 0 && *tail == 0));
>      it->prev = 0;
>      it->next = *head;
>      if (it->next) it->next->prev = it;
>      *head = it;
>      if (*tail == 0) *tail = it;
> +    if (it->exptime) {
> +        e_bucket = &EXPIRE_BUCKET(it->exptime);
> +        it->e_prev = NULL;
> +        it->e_next = *e_bucket;
> +        if (it->e_next) it->e_next->e_prev = it;
> +        *e_bucket = it;
> +    }
>      sizes[it->slabs_clsid]++;
>      return;
>  }
>
>  static void item_unlink_q(item *it) {
> -    item **head, **tail;
> +    item **head, **tail, **e_bucket;
>      /* always true, warns: assert(it->slabs_clsid <= LARGEST_ID); */
>      head = &heads[it->slabs_clsid];
>      tail = &tails[it->slabs_clsid];
>
>      if (*head == it) {
>          assert(it->prev == 0);
>          *head = it->next;
>      }
>      if (*tail == it) {
>          assert(it->next == 0);
>          *tail = it->prev;
>      }
>      assert(it->next != it);
>      assert(it->prev != it);
>
>      if (it->next) it->next->prev = it->prev;
>      if (it->prev) it->prev->next = it->next;
> +
> +    if (it->exptime) {
> +        e_bucket = &EXPIRE_BUCKET(it->exptime);
> +        if (*e_bucket == it) {
> +            assert(it->e_prev == NULL);
> +            *e_bucket = it->e_next;
> +        }
> +        assert(it->e_next != it);
> +        assert(it->e_prev != it);
> +        if (it->e_next) it->e_next->e_prev = it->e_prev;
> +        if (it->e_prev) it->e_prev->e_next = it->e_next;
> +    }
> +
>      sizes[it->slabs_clsid]--;
>      return;
>  }
>
>  int do_item_link(item *it) {
>      MEMCACHED_ITEM_LINK(ITEM_key(it), it->nbytes);
>      assert((it->it_flags & (ITEM_LINKED|ITEM_SLABBED)) == 0);
>      assert(it->nbytes < (1024 * 1024));  /* 1MB max size */
>      it->it_flags |= ITEM_LINKED;
>      it->time = current_time;
> Index: memcached-autoexpire/memcached-1.2.8/items.h
> ===================================================================
> --- memcached-autoexpire/memcached-1.2.8/items.h (revision 16750)
> +++ memcached-autoexpire/memcached-1.2.8/items.h (revision 17289)
> @@ -1,13 +1,15 @@
>  /* See items.c */
>  void item_init(void);
>  uint64_t get_cas_id(void);
> +
> +rel_time_t last_cleanup_time; /** relative timestamp pointing a
> bucket that we should now clean up  **/
>
>  /*...@null@*/
>  item *do_item_alloc(char *key, const size_t nkey, const int flags,
> const rel_time_t exptime, const int nbytes);
>  void item_free(item *it);
>  bool item_size_ok(const size_t nkey, const int flags, const int
> nbytes);
>
>  int  do_item_link(item *it);     /** may fail if transgresses limits
> */
>  void do_item_unlink(item *it);
>  void do_item_remove(item *it);
>  void do_item_update(item *it);   /** update LRU time to current and
> reposition */
> Index: memcached-autoexpire/memcached-1.2.8/memcached.h
> ===================================================================
> --- memcached-autoexpire/memcached-1.2.8/memcached.h (revision 16750)
> +++ memcached-autoexpire/memcached-1.2.8/memcached.h (revision 17289)
> @@ -109,20 +109,22 @@
>  #define ITEM_LINKED 1
>  #define ITEM_DELETED 2
>
>  /* temp */
>  #define ITEM_SLABBED 4
>
>  typedef struct _stritem {
>      struct _stritem *next;
>      struct _stritem *prev;
>      struct _stritem *h_next;    /* hash chain next */
> +    struct _stritem *e_next;    /* next item in the same
> expire_bucket */
> +    struct _stritem *e_prev;    /* prev item in the same
> expire_bucket */
>      rel_time_t      time;       /* least recent access */
>      rel_time_t      exptime;    /* expire time */
>      int             nbytes;     /* size of data */
>      unsigned short  refcount;
>      uint8_t         nsuffix;    /* length of flags-and-length string
> */
>      uint8_t         it_flags;   /* ITEM_* above */
>      uint8_t         slabs_clsid;/* which slab class we're in */
>      uint8_t         nkey;       /* key length, w/terminating null and
> padding */
>      uint64_t        cas_id;     /* the CAS identifier */
>      void * end[];

Reply via email to