On Sun, Feb 17, 2019 at 11:51 PM Chris M. Thomasson <[email protected]> wrote:
>
> On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov wrote:
>>
>> On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson <[email protected]> 
>> wrote:
>> >
>> > Fwiw, here is a simple benchmark for an older read/write algorithm I 
>> > invented:
>> >
>> >
>> > https://pastebin.com/raw/xCBHY9qd
>> >
>> >
>> > It should compile fine on MSVC 2017 and GCC with -std=c++17. There is a 
>> > macro in the code called CT_TEST_FAST_MUTEX. If it is defined then the 
>> > code will test my algorithm. If not defined then it will test against 
>> > std::shared_mutex. I am wondering if anybody can compile and run the code? 
>> > Test it out on as many operating systems as you can. Can you please show 
>> > the output?
>> >
>> >
>> > Thanks everybody. :^)
>> >
>> > Here is the code:
>> >
>> > /* Simple, crude read/write mutex test
>> >    by: Chris M. Thomasson
>> > __________________________________________*/
>> [...]
>
>
>>
>>
>> Hey Chris!
>>
>> Here are my results on i7-8650U (4 HT cores) on Debian 4.19.16-1 gcc 7.3.0
>>
>> std::shared_mutex
>> msec = 56118
>> msec = 49149
>> msec = 69416
>> msec = 68737
>> msec = 59174
>> msec = 65915
>>
>> ct_rwmutex
>> msec = 62772
>> msec = 71139
>> msec = 68110
>> msec = 66038
>> msec = 60436
>> msec = 74238
>
>
>
> Thank you so much! Okay, not great, but all that "radically" hardcore 
> terrible when compared to a std::shared_mutex on your end. Also, my work can 
> benefit from many sessions of padding and alignment therapy wrt 
> data-structure layout.
>
>
>>
>>
>> I can also test on 2 CPU system with 88 hardware threads in total, but
>> the bench hardcodes 16 threads (there is std::hardware_concurrency()
>> or something like that).
>
>
> Indeed! Will correct this issue. Make it dynamic. :^)
>
> Although, it can be beneficial to create more threads just to see how the 
> algorithm handles these cases of "oversubscribed" contention.

Agree. So std::hardware_concurrency(), 2*std::hardware_concurrency(),
4*std::hardware_concurrency().


>> I've found that it's critical to model realistic/target ratio between
>> reads/writes and amount of local work and work inside of critical
>> section in such benchmarks. Otherwise in 100% synthetic hammering
>> usually the worst mutex shows the best results. Synthetic benchmarks
>> produce extreme contention, so a mutex that blocks threads as much as
>> possible and allows as few threads as possible to work, shows the best
>> result because contention is minimized. The best would be to run all
>> threads sequentially one-by-one. But that's counter productive for
>> real life because blocked threads don't do useful work too.
>
>
> Agreed. Just wondering why it performs so much better in some high contention 
> scenarios. Load imbalance is a factor for sure.
>
> https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/L7Sjs0xOBAAJ
>
> https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/hMIu5VBSBAAJ

I don't need to tell you that performance of sync primitives can be
very non-linear and counter-intuitive :)

Perhaps the other mutexes were not super dumb, so they did not perform
the best under very high contention.


>> Also this benchmark has fixed amount of work per thread, so I suspect
>> it may be subject to load imbalance when an unlucky outliner delays
>> total result. A better option would be to run all threads for X
>> seconds and then set a global var for all of them to stop and then
>> join all of them and measure number of iterations.
>
>
> Indeed. A benckmark that just modeled a list that readers iterated, and 
> writers pushed and popped from. Then we can show how many operations, reads 
> or writes, they performed per-second. So it would be the number of reads and 
> writes per-second, per-thread.
>
> This reminds of some of Joe Seighs tests back on comp.programming.threads.
>
>
>>
>>
>> Also for me it lacked #include <climits>.
>
>
> Fixed that. Thanks again:
>
> https://pastebin.com/raw/xCBHY9qd
> (reload the page...)
>
>
> Fwiw, I am wondering what you think about the algorithm itself? Is it crap, 
> or decent? :^)


I think it's one of the best algorithms out there.
The complete fairness for both readers and writers is notable.
And the wait-free fast-path for readers. It still suffers from high
cache line contention even on read-only workload, but it's as good as
one can get with a centralized design (i.e. not per-thread/cpu
distributed stuff which has own problems). I would expect a
CAS-loop-based read lock to degrade significantly under high read
load.
Btw your algorithm is used as the standard Go RWMutex for the past 7+ years:
https://github.com/golang/go/commit/daaf29cf932
(I should have mentioned your authorship somewhere!)

As for potential improvements I can only think of optimizing
uncontented writer lock/unlock to be 1 RMW each, i.e. not offloading
writer competition resolution to a separate mutex, but rather
incorporate it into the same atomic variable readers and writers use
for synchronization with each other. Do you think it's possible? With
CAS-loop? Or maybe with some smart atomic_fetch_or? Wait,
atomic_fetch_or is compiled to a CAS loop on x86 anyway. These new
C/C++ atomic interfaces sometimes make me forget that.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3vJexr7tLA7UAu7OhCMYd-DHBrnpYnsHNiQ9C9Y0suNXw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to