Hello all,
I participate in managing a very large site (in alexa's top 70).
Unfortunately due to NDA I can't reveal its name.
We get about 8 mil unique visits per day and currently memcached is
the only thing that keeps the site going.
We have two memcached servers and currently use them in such way:
-look in memcached1 for value
-if not found -> do a querry - write result in memcached1 and 2
-if memcached1 is not available -> look in memcached1 for values
I know that's not the proper way to do this, but at the time we didn't
have that much spare servers.
Now the problem is this.
During normal operation, these are memcached stats:
STAT pid 7995
STAT uptime 4984
STAT time 1254317043
STAT version 1.4.1
STAT pointer_size 32
STAT rusage_user 334.184196
STAT rusage_system 915.517820
STAT curr_connections 529
STAT total_connections 4098258
STAT connection_structures 727
STAT cmd_get 55994089
STAT cmd_set 2103657
STAT cmd_flush 0
STAT get_hits 51099680
STAT get_misses 4894409
STAT delete_misses 147084
STAT delete_hits 916142
STAT incr_misses 0
STAT incr_hits 0
STAT decr_misses 0
STAT decr_hits 0
STAT cas_misses 0
STAT cas_hits 0
STAT cas_badval 0
STAT bytes_read 2259781347
STAT bytes_written 30945451617
STAT limit_maxbytes 0
STAT accepting_conns 1
STAT listen_disabled_num 0
STAT threads 9
STAT conn_yields 0
STAT bytes 235845560
STAT curr_items 175108
STAT total_items 2103657
STAT evictions 0

However from time to time (once a week, sometimes twice a day), the
connections to the server jump to about 3000 and stay that way, the DB
gets flooded with connections, those who can - connect and do queries,
the rest stay in TIME_WAIT. During normal operation the DB (mysql) has
about 30 active connections at a given time and about 100 TIME_WAIT-
ing.
When the memcached problem appears, the connections to the DB jump to
2000 (the max amount) and the ones waiting are over 40K.
Memcached restart (both servers) solves the problem - gradually in
10minutes time, the most frequent queries are cached and everything
gets quiet again. I made a script that sets a key with value the
current time and expire time 50 seconds, 60 seconds later checks if
that key is still there. 999 times out of 1000 the key has expired but
sometimes I get a message from the script that the key is still there.
I also tried setting 10000 unique keys with expire time 50 seconds, 60
seconds later get all of them, well all had expired as expected, but I
still have not run that while the memcached servers are in "strange-
behaviour-mode".
Do you have any ideas/suggestions - at least how to diagnose the
problem, or how to reproduce it and what could be the problem?
Currently I have a cronjob that restarts one server every odd hour and
the other every even hour, but that's not pretty.
We are in the process of setting up around 10 memcached servers in a
standard server array, hopefully that will solve the problem.

Thanks in advance and sorry for the long post,
Cheers

Reply via email to