Hello all, I participate in managing a very large site (in alexa's top 70). Unfortunately due to NDA I can't reveal its name. We get about 8 mil unique visits per day and currently memcached is the only thing that keeps the site going. We have two memcached servers and currently use them in such way: -look in memcached1 for value -if not found -> do a querry - write result in memcached1 and 2 -if memcached1 is not available -> look in memcached1 for values I know that's not the proper way to do this, but at the time we didn't have that much spare servers. Now the problem is this. During normal operation, these are memcached stats: STAT pid 7995 STAT uptime 4984 STAT time 1254317043 STAT version 1.4.1 STAT pointer_size 32 STAT rusage_user 334.184196 STAT rusage_system 915.517820 STAT curr_connections 529 STAT total_connections 4098258 STAT connection_structures 727 STAT cmd_get 55994089 STAT cmd_set 2103657 STAT cmd_flush 0 STAT get_hits 51099680 STAT get_misses 4894409 STAT delete_misses 147084 STAT delete_hits 916142 STAT incr_misses 0 STAT incr_hits 0 STAT decr_misses 0 STAT decr_hits 0 STAT cas_misses 0 STAT cas_hits 0 STAT cas_badval 0 STAT bytes_read 2259781347 STAT bytes_written 30945451617 STAT limit_maxbytes 0 STAT accepting_conns 1 STAT listen_disabled_num 0 STAT threads 9 STAT conn_yields 0 STAT bytes 235845560 STAT curr_items 175108 STAT total_items 2103657 STAT evictions 0
However from time to time (once a week, sometimes twice a day), the connections to the server jump to about 3000 and stay that way, the DB gets flooded with connections, those who can - connect and do queries, the rest stay in TIME_WAIT. During normal operation the DB (mysql) has about 30 active connections at a given time and about 100 TIME_WAIT- ing. When the memcached problem appears, the connections to the DB jump to 2000 (the max amount) and the ones waiting are over 40K. Memcached restart (both servers) solves the problem - gradually in 10minutes time, the most frequent queries are cached and everything gets quiet again. I made a script that sets a key with value the current time and expire time 50 seconds, 60 seconds later checks if that key is still there. 999 times out of 1000 the key has expired but sometimes I get a message from the script that the key is still there. I also tried setting 10000 unique keys with expire time 50 seconds, 60 seconds later get all of them, well all had expired as expected, but I still have not run that while the memcached servers are in "strange- behaviour-mode". Do you have any ideas/suggestions - at least how to diagnose the problem, or how to reproduce it and what could be the problem? Currently I have a cronjob that restarts one server every odd hour and the other every even hour, but that's not pretty. We are in the process of setting up around 10 memcached servers in a standard server array, hopefully that will solve the problem. Thanks in advance and sorry for the long post, Cheers
