Re: lock-free counter updates

Neil Mckee Tue, 24 May 2011 14:57:21 -0700

I don't have measurements,  sorry.   I'm just working from the 
finger-in-the-air assumption that even un-contested mutex-locks can cost a 
surprising number of cycles.  (I think the TOP-KEYs mutex lock is much more 
serious because the threads will actually stall over it - that's the kind of 
lock I was really taking care to avoid with the sFlow implementation).


If you think this *might* be worth pursuing then I don't mind implementing it 
for testing,  but a quick way to assess the potential gain would be to turn off 
the stats collection altogether and then see how much the max operations/sec 
can be pushed up.  (e.g. by #defining STATS_INCR1 to a no-op).  I'm not 
confident that I have a setup that can push the system to it's limits,  though, 
 so I don't know if I can offer good measurement data myself.

Neil

P.S.  There's no reason to expect bogus values.   One of the counters in a 
get-stats response could be one or two transactions "ahead" of another because 
it incremented during the time that the counters were being rolled together,  
but you shouldn't see any sharp discontinuity.


On May 24, 2011, at 1:00 PM, Trond Norbye wrote:

> I remember discussing this with Dustin back when I added the threadlocal 
> counters. Back then I suggested that we would just do dirty reads of the 
> counters used by the other threads, but we decided that we could might as 
> well do it the "safe way" and fix it if it ever popped up with lock 
> contention / performance impact. 
> 
> I don't treat the stats as exact science, so I could live with a dirty read 
> and possibly some bogus values reported every now and then ;-) I still think 
> that the extra overhead added by this "safe and simple" implementation with 
> the locks is a low percentage of the number of clock cycles used to process a 
> single request (from the packet enters the network adapter until we send the 
> response).... 
> 
> Do you have any kind of measurements that indicates the benefit we can get 
> from changing the implementation and the amount of code needed to implement 
> it? (I'm not saying I'm against it, I just don't think we should add a 
> "complex" solution to something that we don't notice ;-) )
> 
> Cheers,
> 
> Trond
> 
> 
> On 24. mai 2011, at 21.31, Neil Mckee wrote:
> 
>> Hello all,
>> 
>> In adding support for the sFlow monitoring standard to the latest memcached  
>> (see http://code.google.com/p/memcached/issues/detail?id=202) I used a 
>> scheme for lock-free counter accumulation that might make sense for the 
>> memcached stats counters too.
>> 
>> Background:  the memcached critical path includes a few calls to 
>> pthread_mutex_lock()  that might not strictly be necessary.   Some of these 
>> are per-thread mutex locks that will probably not see much contention,  but 
>> even a user-space-only mutex lock still involves an atomic operation and I 
>> think this can add up to rather a large number of clock cycles,  right?   
>> Enough that it might be limiting the max throughput of a memcached cluster(?)
>> 
>> First,  I apologize if this has all been thrashed out before!
>> 
>> Second,  a question:  Is is safe to say that all platforms using memcached 
>> today are at least 32-bit natively?   In other words,  can we assume that an 
>> aligned 32-bit STORE or FETCH can be considered atomic?  (even though a ++ 
>> increment is emphatically NOT atomic).
>> 
>> If so,  how about a new scheme for maintaining stats counters:
>> 
>> 1) define aligned 32-bit thread-local variables for *all* the counters (even 
>> the global "server" ones),  and bump only these ones in the critical path - 
>> with no mutex locking.
>> 2) every 100mS or so,  or when a worker-thread terminates, accumulate these 
>> counters into their 64-bit equivalents so that they can be served in 
>> response to the GET-STATS command.  Again, no locking on the per-thread 
>> counters:  just a lock on the 64-bit counter-structures.
>> 
>> I suggest 100mS because it is enough to get this out of the critical path,  
>> avoids undetected 32-bit rollovers,  and also avoids any significant 
>> re-sampling error, e.g. if someone is polling GET-STATS every 10 seconds 
>> they will only see a 1% error due to re-sampling.
>> 
>> Thoughts?
>> 
>> Neil
>> 
>> 
>> 
>

Re: lock-free counter updates

Reply via email to