We run 1.4.2 on our cluster though it goes OOM every few days and hence are 
trying to migrate to a newer build. Release notes of 1.4.15 specifically 
say that it fixes some OOM cases, so we tried 1.4.15. We noted that around 
every 40-60 mins, sets would fail in burst. On further digging we found 
that each such burst corresponds to a spike in user cpu time and a spike in 
open connections. We found that even if a box isn't serving any production 
traffic, user cpu still spikes at roughly same frequency (though spikes are 
much shorter). During one of the spikes, top showed that it was infact 
memcached process which was hogging cpu. This behavior wasn't observed on 
previous binary. Some differences between old binary and new binary:

* old binary used libevent 1.4.2 whereas new one uses libevent 2.0.16
* old binary was running on Ubuntu 10.04 whereas new one is running on 12.04

Some more details about new binary:

stats
STAT pid 2727
STAT uptime 74268
STAT time 1364419469
STAT version 1.4.15
STAT libevent 2.0.16-stable
STAT pointer_size 64
STAT rusage_user 16275.537157
STAT rusage_system 20872.252434
STAT curr_connections 33
STAT total_connections 270918
STAT connection_structures 3712
STAT reserved_fds 20
STAT cmd_get 2216135698
STAT cmd_set 161257323
STAT cmd_flush 0
STAT cmd_touch 0
STAT get_hits 1970822534
STAT get_misses 245313164
STAT delete_misses 2615317
STAT delete_hits 4184410
STAT incr_misses 366964
STAT incr_hits 3804454
STAT decr_misses 0
STAT decr_hits 0
STAT cas_misses 21315
STAT cas_hits 81561392
STAT cas_badval 1845457
STAT touch_hits 0
STAT touch_misses 0
STAT auth_cmds 0
STAT auth_errors 0
STAT bytes_read 114764483242
STAT bytes_written 396446726727
STAT limit_maxbytes 29360128000
STAT accepting_conns 1
STAT listen_disabled_num 0
STAT threads 4
STAT conn_yields 0
STAT hash_power_level 26
STAT hash_bytes 536870912
STAT hash_is_expanding 0
STAT bytes 10653107460
STAT curr_items 61033194
STAT total_items 163080654
STAT expired_unfetched 57
STAT evicted_unfetched 0
STAT evictions 0
STAT reclaimed 72
END


stats settings
STAT maxbytes 3590324224
STAT maxconns 100000
STAT tcpport 11211
STAT udpport 11211
STAT inter 0.0.0.0
STAT verbosity 0
STAT oldest 0
STAT evictions on
STAT domain_socket NULL
STAT umask 700
STAT growth_factor 1.25
STAT chunk_size 48
STAT num_threads 4
STAT num_threads_per_udp 4
STAT stat_key_prefix :
STAT detail_enabled no
STAT reqs_per_event 20
STAT cas_enabled yes
STAT tcp_backlog 1024
STAT binding_protocol auto-negotiate
STAT auth_enabled_sasl no
STAT item_size_max 1048576
STAT maxconns_fast no
STAT hashpower_init 0
STAT slab_reassign no
STAT slab_automove 0
END

We get around 10K operations per second (get + multi get + set) per server.

root@mc20:~# ps aux | grep mem | grep -v grep
nobody    2727 49.4 38.5 13935812 13508096 ?   Ssl  00:46 619:36 
/usr/bin/memcached -m 28000 -p 11211 -u nobody -l 0.0.0.0 -d -c 100000

User cpu spikes every 40-60 minutes:

<https://lh6.googleusercontent.com/-thu0cArZjIE/UVNkiUvVd2I/AAAAAAAABF8/yEsM0ni5mDI/s1600/user_cpu.gif>

Open connections seem to spike at same time:

<https://lh4.googleusercontent.com/-cm_PVsS1S0A/UVNkzaiP-4I/AAAAAAAABGE/-NohVkmdOJA/s1600/open_connections.gif>

User cpu graph for a non-production server at similar frequency but much 
shorter spikes:

<https://lh3.googleusercontent.com/-XTfcDznL3Us/UVNlVztQ_mI/AAAAAAAABGM/be-rES0mXps/s1600/non_prod_user_cpu.gif>

I obtained strace during one of the spikes though found nothing suspicious 
about it. Can provide it, if it is helpful. I also have the output of ls -l 
/proc/$(pidof memcached)/fd from a spike.

Is there some background thread which does some heavy duty work every some 
minutes?

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to