I remember discussing this with Dustin back when I added the threadlocal 
counters. Back then I suggested that we would just do dirty reads of the 
counters used by the other threads, but we decided that we could might as well 
do it the "safe way" and fix it if it ever popped up with lock contention / 
performance impact. 

I don't treat the stats as exact science, so I could live with a dirty read and 
possibly some bogus values reported every now and then ;-) I still think that 
the extra overhead added by this "safe and simple" implementation with the 
locks is a low percentage of the number of clock cycles used to process a 
single request (from the packet enters the network adapter until we send the 
response).... 

Do you have any kind of measurements that indicates the benefit we can get from 
changing the implementation and the amount of code needed to implement it? (I'm 
not saying I'm against it, I just don't think we should add a "complex" 
solution to something that we don't notice ;-) )

Cheers,

Trond


On 24. mai 2011, at 21.31, Neil Mckee wrote:

> Hello all,
> 
> In adding support for the sFlow monitoring standard to the latest memcached  
> (see http://code.google.com/p/memcached/issues/detail?id=202) I used a scheme 
> for lock-free counter accumulation that might make sense for the memcached 
> stats counters too.
> 
> Background:  the memcached critical path includes a few calls to 
> pthread_mutex_lock()  that might not strictly be necessary.   Some of these 
> are per-thread mutex locks that will probably not see much contention,  but 
> even a user-space-only mutex lock still involves an atomic operation and I 
> think this can add up to rather a large number of clock cycles,  right?   
> Enough that it might be limiting the max throughput of a memcached cluster(?)
> 
> First,  I apologize if this has all been thrashed out before!
> 
> Second,  a question:  Is is safe to say that all platforms using memcached 
> today are at least 32-bit natively?   In other words,  can we assume that an 
> aligned 32-bit STORE or FETCH can be considered atomic?  (even though a ++ 
> increment is emphatically NOT atomic).
> 
> If so,  how about a new scheme for maintaining stats counters:
> 
> 1) define aligned 32-bit thread-local variables for *all* the counters (even 
> the global "server" ones),  and bump only these ones in the critical path - 
> with no mutex locking.
> 2) every 100mS or so,  or when a worker-thread terminates, accumulate these 
> counters into their 64-bit equivalents so that they can be served in response 
> to the GET-STATS command.  Again, no locking on the per-thread counters:  
> just a lock on the 64-bit counter-structures.
> 
> I suggest 100mS because it is enough to get this out of the critical path,  
> avoids undetected 32-bit rollovers,  and also avoids any significant 
> re-sampling error, e.g. if someone is polling GET-STATS every 10 seconds they 
> will only see a 1% error due to re-sampling.
> 
> Thoughts?
> 
> Neil
> 
> 
> 

Reply via email to