Hello all, In adding support for the sFlow monitoring standard to the latest memcached (see http://code.google.com/p/memcached/issues/detail?id=202) I used a scheme for lock-free counter accumulation that might make sense for the memcached stats counters too.
Background: the memcached critical path includes a few calls to pthread_mutex_lock() that might not strictly be necessary. Some of these are per-thread mutex locks that will probably not see much contention, but even a user-space-only mutex lock still involves an atomic operation and I think this can add up to rather a large number of clock cycles, right? Enough that it might be limiting the max throughput of a memcached cluster(?) First, I apologize if this has all been thrashed out before! Second, a question: Is is safe to say that all platforms using memcached today are at least 32-bit natively? In other words, can we assume that an aligned 32-bit STORE or FETCH can be considered atomic? (even though a ++ increment is emphatically NOT atomic). If so, how about a new scheme for maintaining stats counters: 1) define aligned 32-bit thread-local variables for *all* the counters (even the global "server" ones), and bump only these ones in the critical path - with no mutex locking. 2) every 100mS or so, or when a worker-thread terminates, accumulate these counters into their 64-bit equivalents so that they can be served in response to the GET-STATS command. Again, no locking on the per-thread counters: just a lock on the 64-bit counter-structures. I suggest 100mS because it is enough to get this out of the critical path, avoids undetected 32-bit rollovers, and also avoids any significant re-sampling error, e.g. if someone is polling GET-STATS every 10 seconds they will only see a 1% error due to re-sampling. Thoughts? Neil
