We run 1.4.2 on our cluster though it goes OOM every few days and hence are trying to migrate to a newer build. Release notes of 1.4.15 specifically say that it fixes some OOM cases, so we tried 1.4.15. We noted that around every 40-60 mins, sets would fail in burst. On further digging we found that each such burst corresponds to a spike in user cpu time and a spike in open connections. We found that even if a box isn't serving any production traffic, user cpu still spikes at roughly same frequency (though spikes are much shorter). During one of the spikes, top showed that it was infact memcached process which was hogging cpu. This behavior wasn't observed on previous binary. Some differences between old binary and new binary:
* old binary used libevent 1.4.2 whereas new one uses libevent 2.0.16 * old binary was running on Ubuntu 10.04 whereas new one is running on 12.04 Some more details about new binary: stats STAT pid 2727 STAT uptime 74268 STAT time 1364419469 STAT version 1.4.15 STAT libevent 2.0.16-stable STAT pointer_size 64 STAT rusage_user 16275.537157 STAT rusage_system 20872.252434 STAT curr_connections 33 STAT total_connections 270918 STAT connection_structures 3712 STAT reserved_fds 20 STAT cmd_get 2216135698 STAT cmd_set 161257323 STAT cmd_flush 0 STAT cmd_touch 0 STAT get_hits 1970822534 STAT get_misses 245313164 STAT delete_misses 2615317 STAT delete_hits 4184410 STAT incr_misses 366964 STAT incr_hits 3804454 STAT decr_misses 0 STAT decr_hits 0 STAT cas_misses 21315 STAT cas_hits 81561392 STAT cas_badval 1845457 STAT touch_hits 0 STAT touch_misses 0 STAT auth_cmds 0 STAT auth_errors 0 STAT bytes_read 114764483242 STAT bytes_written 396446726727 STAT limit_maxbytes 29360128000 STAT accepting_conns 1 STAT listen_disabled_num 0 STAT threads 4 STAT conn_yields 0 STAT hash_power_level 26 STAT hash_bytes 536870912 STAT hash_is_expanding 0 STAT bytes 10653107460 STAT curr_items 61033194 STAT total_items 163080654 STAT expired_unfetched 57 STAT evicted_unfetched 0 STAT evictions 0 STAT reclaimed 72 END stats settings STAT maxbytes 3590324224 STAT maxconns 100000 STAT tcpport 11211 STAT udpport 11211 STAT inter 0.0.0.0 STAT verbosity 0 STAT oldest 0 STAT evictions on STAT domain_socket NULL STAT umask 700 STAT growth_factor 1.25 STAT chunk_size 48 STAT num_threads 4 STAT num_threads_per_udp 4 STAT stat_key_prefix : STAT detail_enabled no STAT reqs_per_event 20 STAT cas_enabled yes STAT tcp_backlog 1024 STAT binding_protocol auto-negotiate STAT auth_enabled_sasl no STAT item_size_max 1048576 STAT maxconns_fast no STAT hashpower_init 0 STAT slab_reassign no STAT slab_automove 0 END We get around 10K operations per second (get + multi get + set) per server. root@mc20:~# ps aux | grep mem | grep -v grep nobody 2727 49.4 38.5 13935812 13508096 ? Ssl 00:46 619:36 /usr/bin/memcached -m 28000 -p 11211 -u nobody -l 0.0.0.0 -d -c 100000 User cpu spikes every 40-60 minutes: <https://lh6.googleusercontent.com/-thu0cArZjIE/UVNkiUvVd2I/AAAAAAAABF8/yEsM0ni5mDI/s1600/user_cpu.gif> Open connections seem to spike at same time: <https://lh4.googleusercontent.com/-cm_PVsS1S0A/UVNkzaiP-4I/AAAAAAAABGE/-NohVkmdOJA/s1600/open_connections.gif> User cpu graph for a non-production server at similar frequency but much shorter spikes: <https://lh3.googleusercontent.com/-XTfcDznL3Us/UVNlVztQ_mI/AAAAAAAABGM/be-rES0mXps/s1600/non_prod_user_cpu.gif> I obtained strace during one of the spikes though found nothing suspicious about it. Can provide it, if it is helpful. I also have the output of ls -l /proc/$(pidof memcached)/fd from a spike. Is there some background thread which does some heavy duty work every some minutes? -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
