We have experienced this on a server where ntpd just decided to stop
working for days without us realizing it. I think starting at 0 is sane
idea.
Brian.
--------
http://brian.moonspot.net/
On 6/30/10 1:59 PM, dormando wrote:
Yo,
Something I've been wondering about for a while is a class of bug that we
have with clock skew.
Memcached handles the expiration timer via:
- "process_started" timestamp in seconds, which gets initialized at
startup
- "current_time" which, once per second, gets set to the delta between
the current time and "process_started"
If your clock swings around wildly there're a few situations where you
could potentially end up with items expiring immediately or never, such as
current_time ending up underflowing.
A couple easy ideas off the top of my head that would drop some accuracy
for avoiding timers (and any cross-platform timer idiocy):
- Ditch "proess_started" and kick a counter at 0. Every second the
current_time would be incremented by 1. A relative timeout of "60 seconds
from now" would be set to "current_time + 60" as it presently is. We'd
have to do something special for date formatted expirations. Potentially
by noting the exact time once on startup and using that to delta against a
provided date to provide the delta-in-seconds. The latter can still be
influenced by bad clock, but maybe not as noticable and the feature is
less used.
- Add some sanity checks in the clock update function, which will fall
back to incrementing by 1 if it detects a significant clock correction
forward, or if it's gone back in time. Still uses gettimeofday() unless
something goes wrong, keeps plodding forward less accurately when
something does go wrong.
- Use some anti-clock-skew magic that maybe libevent uses. Need to
research more options :P
Anyone care? The increase in the number of these types of reports is
getting obnoxious, and cloud computing's god-awful-ness can only make it
worse.
-Dormando