On Wed, 16 Oct 2019 at 03:15, Yann Ylavic <ylavic....@gmail.com> wrote: > > Hi Ivan, > > On Tue, Oct 15, 2019 at 11:13 AM Ivan Zhakov <i...@apache.org> wrote: > > > > On Tue, 15 Oct 2019 at 01:16, Yann Ylavic <ylavic....@gmail.com> wrote: > > > > > > Then I don't know if we care enough about WIN32, but if so the > > > InterlockedCompareExchange64() call here is maybe a bit of a hammer > > > for usual apr_atomic_read64() usage. For things like lightly check for > > > the current 64bit shared value and do the heavy CAS work for some > > > special/unlikely condition only, this change is going to kill > > > performances. Since we provide _cas64() already, I fell that > > > read64/set64() should always be lightweight (when/if possible). > > > > > I considered this optimization, but decided not to implement it to > > avoid additional #ifdef. > > > > Do you have any evidence that CAS would be significantly slower than > > just read on x64? > > Consider a code like this: > > apr_uint64_t old; > do { > old = apr_atomic_read64(shared); > } while (apr_atomic_cas64(shared, old + 1, old) != old); > > which is a common lock-free pattern (here it's a re-implementation of > apr_atomic_inc64() with cas64, but it's just for the example). > > A "lock cmpxchg" for apr_atomic_read64() is going to "cost" one more > full barrier (i.e. serialization of outstanding loads and stores on > all cores) per loop, while the loop is intended to spin when the race > is lost. > The minimum garantee needed for read64() here is a single/atomic > memory access, which is clearly not the case on WIN32 according to > your results (see my comments below), but which should be the case on > 64bit CPUs Windows cares about (and if we care about arm64 too, we > should use ReadAcquire64() here to keep the same acquire/release > ordering garantees than on x86). > So, IMHO, the full barrier for apr_atomic_read64() is overkill on > 64bit CPUs with strong ordering garantees (like x86_64), and if one > wants that s/he can still use apr_atomic_cas64(mem, 0, 0) explicitely. > > I wrote the attached test and ran it on my linux x86_64 quad-core-CPU > (8 w/ hyperthreading) laptop. > The test is essentially starting T threads which run the above loop N > times to update C counters, and finally prints the average time and > number of (re)tries (per loop). > The implementation of apr_atomic_read64 (volatile or cas64), T, N and > C can be specified by the command line. > > Some results: > $ (T=10000000; N=8; C=4; \ > for i in {1..2}; do > for impl in cas64 volatile; do > echo "# $impl; T=$T; N=$N; C=$C"; > ./volatile_vs_cas64 $impl $T $N $C; > done; > done) > # cas64; T=10000000; N=8; C=4 > tries = 1.629 (per loop) > usecs = 0.486 (per loop) > # volatile; T=10000000; N=8; C=4 > tries = 1.713 (per loop) > usecs = 0.296 (per loop) > # cas64; T=10000000; N=8; C=4 > tries = 1.646 (per loop) > usecs = 0.527 (per loop) > # volatile; T=10000000; N=8; C=4 > tries = 1.722 (per loop) > usecs = 0.371 (per loop) > > > Multiple runs like this, including with other parameters, seem to > indicate that cas64 causes less retries but takes overall more time to > complete. > My interpretation is that the serialization in read64 gives the > following cas64 more chances to use an up-to-date old value, but it > also kills CPU caching. > On the other end, volatile access needs no cache consistency thus the > old value is less accurate, but spinning is cheaper (faster). > > This kind of micro-benching is of course to be taken with a pinch of > salt, and can't replace a real workload.. > > > > > But I can add special code for x64 if you think it's worth it, along the > > lines > > of the attached patch. > > I think it's worth it, though I have no real proof... Overall, that seems to be worthwhile. I've added optimization for apr_atomic_read64() on Windows x64 in r1868502. Thanks!
I didn't change apr_atomic_set64() since it was using InterlockedExchange() before my commit. > > > So possibly r1868129 is not necessary, or if > > > ReadAcquire64/WriteRelease64() can be used easily could we do that > > > instead? (having a look at the WIN32 assembly code for > > > apr_atomic_read64/set64() using ReadAcquire64/WriteRelease64() could > > > be instructive, though) > > > > > The r1868129 is necessary: there is test in r1868129 that fails on x86, > > without fix. > > OK, I was hoping that "volatile" could be special enough for the > compiler to issue a single load, but the limited number of 64bit > registers on x86 (the x87 one mainly) make this unrealistic for the > general case. So unless we use assembly language here (or C11), we > can't have this garantee. > > > > > While ReadAcquire64()/WriteRelease64() look promising, > > 1. They doesn't provide atomic read: test added in r1868129 fails if > > I replace InterlockedCompareExchange64() to ReadAcquire64() in > > apr_atomic_read64().. > > That's on WIN32 right? Or it also fails on WIN64?? > It fails in 32-bit Windows, but works on Windows x64. I don't have ARM64 to test. > > 2. They are undocumented. > > Indeed.. Since we don't support arm64 CPUs (AFAIK), your > "apr-optimize-read64-set64-v1" patch (which does the same) works for > me. I've changed condition to be _M_X64, i.e. only x86_x64 will be using direct memory read. -- Ivan Zhakov