Hi there, I've recently been investigating an intermittent & transient failure-to-set issue, in a long-running memcached instance. And I believe I could use some insight from you all.
Let me list my configurations first. I have |stats| and |stats slabs| dumps available as Google Groups attachment. If they fail to go through just lemme re-post them on some pastebin service. *Configuration:* Command line arg: -m 2900 -f 1.16 -c 10240 -k -o modern Using 1.4.36 (compiled by myself) on Ubuntu 14.04.4 x64. The -k flag has been verified to be effective (I've got limits configured correctly). Growth factor of 1.16 is just an empirical value for my item sizes. *Symptom of the issue:* After running the memcached for around 10 days, there have been occasions where a set request of an large item (sized around 760KiBs to 930KiBs) would fail, where memcached returns 37 (item too big). However, when this happens, if I wait for around one minute, and then send the same set request again (with exactly the same key/item/expiration to store), memcached would gladly store it. Further get requests verify that the item is correctly stored. According to my logs, this happens intermittently, and I haven't been able to correlate those transient failures with my slab stats. *Observation & Question 1:* Q1: Does my issue arise because when the initial set request arrives at memcached, memcached has to run the slab automover to produce a slab (maybe two slabs, since the item is larger than 512KiB) to accommodate the set request? This is my hunch --- I am yet to do a quick |stats| dump at the exact moment of the set failure to confirm this. But I have seen [slab_reassign_busy_items = 10K] and [slabs_moved = 16.9K] in my |stats| dumps, which means the slab automover must have been triggered during memcached's entire life time. This leads to my next question: *Observation & Question 2 & 3:* Q2: When the slab automover is running, would it possibly block the large-item set request, as in my case above? Q3: Why would memcached favor triggering slab automover over allocating new memory, when there is still host memory available? According to the stats dumps, my memcached instance has [total_malloced = 793MiB], and a footprint of [bytes = 392.33MiB] --- both fall far short of [limit_maxbytes = 2900MiB]. Furthermore, nothing has been evicted as I have got [evictions = 0] (And the host system has extremely enough free physical memory, per |free -m|) I would expect that allocating memory would be faster (and *way* faster actually) than triggering slab automover to reassign slabs to accommodate the incoming set request, and that allocating memory would allow the initial set request to be served immediately. In addition, if the slab automover just happens to be running when the large-item set request arrives, and the answer to Q2 is "yes"... can we make it not block if there's still host memory available? I'm kinda out of clues here...and I might actually be on a wrong route in my investigation. Any insight is appreciated, and it'd be great if I can get rid of those set failures without having to summon a dinosaur. For example, would disabling slab automover be an acceptable band-aid fix? (and that I launch the manual mover (mc_slab_mover) when I know I have relatively lighter traffic) Thanks a lot. p.s. While 'retry this set request at a later time' will work (anecdotally), I don't want to implement a retry mechanism at client side, since 1) the 'later time' is probably non-deterministic, and 2) I don't have a readily available construct to decouple such retry from the rest of my task, and thus having to retry would unnecessarily block client side. -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
stats_0207
Description: Binary data
stats_slabs_0207
Description: Binary data