Hi, We are using "memcache" extensively in our system. It is a mission-critical part of our setup. Each memcached instance serves ~4300 requests per second.
Time to time we were experiencing connection problems related to memcache. It was sometimes client problems. Recently, after investigations we eliminated client problems by using "*mc_conn_tester.pl*" utility. In problematic cases the output of the script looks like the following: >> cat log_one_second_23.09.2013_03_01_00 > Fail: (timeout: 1) (elapsed: 0.99706697) (conn: 0.00038695) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 1.00011182) (conn: 0.00089383) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99846196) (conn: 0.00037885) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99718308) (conn: 0.00032306) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99842286) (conn: 0.00834680) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99408889) (conn: 0.00081396) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99712896) (conn: 0.00053000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99435019) (conn: 0.00065017) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99819398) (conn: 0.00043702) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99858499) (conn: 0.00055599) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99473310) (conn: 0.00081706) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99846315) (conn: 0.00115013) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99548602) (conn: 0.00089598) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99935794) (conn: 0.00079107) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99994612) (conn: 0.00031209) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99520397) (conn: 0.00059295) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99979401) (conn: 0.00081515) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99977612) (conn: 0.00092006) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99882603) (conn: 0.00031114) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99979401) (conn: 0.00097299) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99976993) (conn: 0.00079584) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99978495) (conn: 0.00076199) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99978709) (conn: 0.00082493) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99823999) (conn: 0.00079703) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99691510) (conn: 0.00033808) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99984884) (conn: 0.00076079) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99980712) (conn: 0.00074601) (set: > 0.00000000) (get: 0.00000000) *Looks like set commands timedout.* ONE MIN LATER... cat log_one_second_23.09.2013_03_02_00 > Fail: (timeout: 1) (elapsed: 0.99667096) (conn: 0.00094295) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99732089) (conn: 0.00030780) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99830413) (conn: 0.00040698) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99939990) (conn: 0.00093007) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99816799) (conn: 0.00104308) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99803996) (conn: 0.00046182) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99824309) (conn: 0.00037694) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99974799) (conn: 0.00076199) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99993491) (conn: 0.00064301) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99674511) (conn: 0.00061512) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99985600) (conn: 0.00084496) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99823785) (conn: 0.00057387) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99989510) (conn: 0.00076413) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99977493) (conn: 0.00072813) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99985600) (conn: 0.00085998) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99856400) (conn: 0.00030208) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99905896) (conn: 0.00035596) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99979901) (conn: 0.00030088) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99950194) (conn: 0.00077200) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99942899) (conn: 0.00032091) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99972081)* (conn: 0.00000000) *(set: > 0.00000000) (get: 0.00000000) => unable to connect to server > Fail: (timeout: 1) (elapsed: 0.99969196) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99975896) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99968505) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99983907) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 1.00000811) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99965501) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99987078) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99946094) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99982381) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99980092) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99978995) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99960685) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99984312) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 1.00022602) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99945903) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99983001) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99976397) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99992895) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99990511) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99949980) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99972081) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99978590) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99515796) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99932599) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99973083) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 1.00017118) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99933004) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99977708) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99988508) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99985194) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99997783) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99988699) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99945092) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99429011) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99931216) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 1.00022388) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99930620) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99989700) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99989390) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99956489) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 1.00017381) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99927402) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99990702) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99962592) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99996781) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99967813) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99986005) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99977589) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99988198) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99939990) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) > Fail: (timeout: 1) (elapsed: 0.99980521) (conn: 0.00000000) (set: > 0.00000000) (get: 0.00000000) *After set-cmds timedout, in a few minutes, timeouts occur while connecting.* * It goes on 'till we restart the server.* In addition, only suspicious memache log we have (with verbosity -v) is the following lines. (Thousands of them) > Failed to write, and not due to blocking: Broken pipe > Failed to write, and not due to blocking: Broken pipe > Failed to write, and not due to blocking: Broken pipe > Failed to write, and not due to blocking: Broken pipe > Failed to write, and not due to blocking: Broken pipe * * *Problem: All client connections lost at the same time and do not recover until memcache server restart.* * * *FACTS:* *Server Version: *1.4.15 *Client Version*: spymemcahced-2-7 *OS&Hardware: *5.10 Generic_144488-06 sun4u sparc SUNW,SPARC-Enterpris *operations (get,set,delete)/sec: *4263 *stats settings* > STAT maxbytes 0 > STAT maxconns 1024 > STAT tcpport 11211 > STAT udpport 11211 > STAT inter NULL > STAT verbosity 1 > STAT oldest 0 > STAT evictions on > STAT domain_socket NULL > STAT umask 700 > STAT growth_factor 1.25 > STAT chunk_size 48 > STAT num_threads 28 > STAT num_threads_per_udp 28 > STAT stat_key_prefix : > STAT detail_enabled no > STAT reqs_per_event 20 > STAT cas_enabled yes > STAT tcp_backlog 1024 > STAT binding_protocol auto-negotiate > STAT auth_enabled_sasl no > STAT item_size_max 1048576 > STAT maxconns_fast no > STAT hashpower_init 0 > STAT slab_reassign no > STAT slab_automove 0 > END *stats right before timeouts start:* > delete_hits : 57740453 > bytes : 606062029 > total_items : 201223114 > rusage_system : 136169.687781 > touch_misses : 0 > cmd_touch : 0 > listen_disabled_num : 0 > auth_errors : 0 > evictions : 0 > version : 1.4.15 > pointer_size : 64 > time : 1379894522 > incr_hits : 0 > threads : 28 > expired_unfetched : 375261 > limit_maxbytes : 4294967296 > hash_is_expanding : 0 > bytes_read : 201905411280 > curr_connections : 292 > get_misses : 2113388227 > reclaimed : 417327 > bytes_written : 1579019546599 > hash_power_level : 20 > connection_structures : 294 > cas_hits : 0 > delete_misses : 75083639 > total_connections : 8671074 > rusage_user : 47476.197067 > cmd_flush : 0 > libevent : 2.0.13-stable > uptime : 1185883 > reserved_fds : 140 > touch_hits : 0 > cas_badval : 0 > pid : 26387 > get_hits : 3206648217 > curr_items : 1278371 > cas_misses : 0 > accepting_conns : 1 > evicted_unfetched : 0 > cmd_get : 5320036444 > cmd_set : 202158311 > auth_cmds : 0 > incr_misses : 0 > hash_bytes : 8388608 > decr_misses : 0 > decr_hits : 0 > conn_yields : 16438 What do you think the problem would be? What other piece of information we need to gather? Kind Regards, Doruk -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
