Status: New
Owner: ----
Labels: Type-Defect Priority-Medium
New issue 272 by [email protected]: Memcached "hung" in do_item_alloc
http://code.google.com/p/memcached/issues/detail?id=272
What is the problem?
One night at Etsy, we started getting a flood of memcached errors. We were
seeing out-of-memory errors when trying to set/add any value into slab #4.
The slab held data of sizes upto 10 bytes.
I did some low-level checking around what happens in do_item_alloc()
because we hadn't built debug-symbols :( Details here:
https://gist.github.com/2634998
Starting with search = tails[id]; in line 106. The code's here:
https://github.com/memcached/memcached/blob/master/items.c
The search item has a starting ref count of 2. So calling refcount_incr on
it makes it 3 and it entirely skips the block of code between 107 and 152
(that block handles expired items and forcible eviction). In the else block
in line 154, since the slab is full, slabs_alloc returns NULL. In line 156,
the refcount is decremented back to 2.
Now in the if block starting in line 167, the refcount is still 2, so it
never forcibly expires the item. And this stays like this forever. This
check was changed from a refcount !=0 to refcount !=2 recently
(https://github.com/memcached/memcached/commit/f4983b2068d13e5dc71fc075c35a085e904999cf#items.c)
When this happens the "leaked" tails item forever blocks any new item from
being inserted into the list and we can't do any more operations on the
cluster in slab #4. The only way we could fix it was to restart memcached.
What steps will reproduce the problem?
We don't exactly know how all 7 of our memcached boxes got into this state
one night. We still get occasional blips of these errors.
What is the expected output? What do you see instead?
Set/add should work.
What version of the product are you using? On what operating system?
memcached 1.4.13 on Cent5
Please provide any additional information below.
In our testing environment, I haven't been able to reproduce the issue, but
then the volume I could generate definitely couldn't match up to Production.