Hi,

We are using "memcache" extensively in our system. It is a mission-critical 
part of our setup. Each memcached instance serves ~4300 requests per second.

Time to time we were experiencing connection problems related to memcache. 
It was sometimes client problems. Recently, after investigations we 
eliminated client problems by using "*mc_conn_tester.pl*" utility.
In problematic  cases the output of the script looks like the following:

>> cat log_one_second_23.09.2013_03_01_00
> Fail: (timeout: 1) (elapsed: 0.99706697) (conn: 0.00038695) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 1.00011182) (conn: 0.00089383) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99846196) (conn: 0.00037885) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99718308) (conn: 0.00032306) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99842286) (conn: 0.00834680) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99408889) (conn: 0.00081396) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99712896) (conn: 0.00053000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99435019) (conn: 0.00065017) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99819398) (conn: 0.00043702) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99858499) (conn: 0.00055599) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99473310) (conn: 0.00081706) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99846315) (conn: 0.00115013) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99548602) (conn: 0.00089598) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99935794) (conn: 0.00079107) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99994612) (conn: 0.00031209) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99520397) (conn: 0.00059295) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99979401) (conn: 0.00081515) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99977612) (conn: 0.00092006) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99882603) (conn: 0.00031114) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99979401) (conn: 0.00097299) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99976993) (conn: 0.00079584) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99978495) (conn: 0.00076199) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99978709) (conn: 0.00082493) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99823999) (conn: 0.00079703) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99691510) (conn: 0.00033808) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99984884) (conn: 0.00076079) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99980712) (conn: 0.00074601) (set: 
> 0.00000000) (get: 0.00000000)

*Looks like set commands timedout.*

ONE MIN LATER...

cat log_one_second_23.09.2013_03_02_00
> Fail: (timeout: 1) (elapsed: 0.99667096) (conn: 0.00094295) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99732089) (conn: 0.00030780) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99830413) (conn: 0.00040698) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99939990) (conn: 0.00093007) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99816799) (conn: 0.00104308) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99803996) (conn: 0.00046182) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99824309) (conn: 0.00037694) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99974799) (conn: 0.00076199) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99993491) (conn: 0.00064301) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99674511) (conn: 0.00061512) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99985600) (conn: 0.00084496) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99823785) (conn: 0.00057387) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99989510) (conn: 0.00076413) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99977493) (conn: 0.00072813) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99985600) (conn: 0.00085998) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99856400) (conn: 0.00030208) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99905896) (conn: 0.00035596) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99979901) (conn: 0.00030088) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99950194) (conn: 0.00077200) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99942899) (conn: 0.00032091) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99972081)* (conn: 0.00000000) *(set: 
> 0.00000000) (get: 0.00000000) => unable to connect to server
> Fail: (timeout: 1) (elapsed: 0.99969196) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99975896) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99968505) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99983907) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 1.00000811) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99965501) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99987078) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99946094) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99982381) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99980092) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99978995) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99960685) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99984312) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 1.00022602) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99945903) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99983001) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99976397) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99992895) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99990511) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99949980) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99972081) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99978590) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99515796) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99932599) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99973083) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 1.00017118) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99933004) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99977708) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99988508) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99985194) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99997783) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99988699) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99945092) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99429011) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99931216) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 1.00022388) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99930620) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99989700) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99989390) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99956489) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 1.00017381) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99927402) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99990702) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99962592) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99996781) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99967813) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99986005) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99977589) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99988198) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99939990) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)
> Fail: (timeout: 1) (elapsed: 0.99980521) (conn: 0.00000000) (set: 
> 0.00000000) (get: 0.00000000)

 *After set-cmds timedout, in a few minutes, timeouts occur while 
connecting.*
* It goes on 'till we restart the server.*

In addition, only suspicious memache log we have  (with verbosity -v) is 
the following lines. (Thousands of them)
 

> Failed to write, and not due to blocking: Broken pipe
> Failed to write, and not due to blocking: Broken pipe
> Failed to write, and not due to blocking: Broken pipe
> Failed to write, and not due to blocking: Broken pipe
> Failed to write, and not due to blocking: Broken pipe


*
*
*Problem: All client connections lost at the same time and do not recover 
until memcache server restart.*
*
*
*FACTS:*
*Server Version: *1.4.15
*Client Version*: spymemcahced-2-7
*OS&Hardware: *5.10 Generic_144488-06 sun4u sparc SUNW,SPARC-Enterpris
*operations (get,set,delete)/sec: *4263
*stats settings*

> STAT maxbytes 0
> STAT maxconns 1024
> STAT tcpport 11211
> STAT udpport 11211
> STAT inter NULL
> STAT verbosity 1
> STAT oldest 0
> STAT evictions on
> STAT domain_socket NULL
> STAT umask 700
> STAT growth_factor 1.25
> STAT chunk_size 48
> STAT num_threads 28
> STAT num_threads_per_udp 28
> STAT stat_key_prefix :
> STAT detail_enabled no
> STAT reqs_per_event 20
> STAT cas_enabled yes
> STAT tcp_backlog 1024
> STAT binding_protocol auto-negotiate
> STAT auth_enabled_sasl no
> STAT item_size_max 1048576
> STAT maxconns_fast no
> STAT hashpower_init 0
> STAT slab_reassign no
> STAT slab_automove 0
> END


*stats right before timeouts start:*

> delete_hits : 57740453 
> bytes : 606062029 
> total_items : 201223114 
> rusage_system : 136169.687781 
> touch_misses : 0 
> cmd_touch : 0 
> listen_disabled_num : 0 
> auth_errors : 0 
> evictions : 0 
> version : 1.4.15 
> pointer_size : 64 
> time : 1379894522 
> incr_hits : 0 
> threads : 28 
> expired_unfetched : 375261 
> limit_maxbytes : 4294967296 
> hash_is_expanding : 0 
> bytes_read : 201905411280 
> curr_connections : 292 
> get_misses : 2113388227 
> reclaimed : 417327 
> bytes_written : 1579019546599 
> hash_power_level : 20 
> connection_structures : 294 
> cas_hits : 0 
> delete_misses : 75083639 
> total_connections : 8671074 
> rusage_user : 47476.197067 
> cmd_flush : 0 
> libevent : 2.0.13-stable 
> uptime : 1185883 
> reserved_fds : 140 
> touch_hits : 0 
> cas_badval : 0 
> pid : 26387 
> get_hits : 3206648217 
> curr_items : 1278371 
> cas_misses : 0 
> accepting_conns : 1 
> evicted_unfetched : 0 
> cmd_get : 5320036444 
> cmd_set : 202158311 
> auth_cmds : 0 
> incr_misses : 0 
> hash_bytes : 8388608 
> decr_misses : 0 
> decr_hits : 0 
> conn_yields : 16438 


What do you think the problem would be?
What other piece of information we need to gather?

Kind Regards,
Doruk 















-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to