On Wed, 16 Oct 2019 at 03:15, Yann Ylavic <ylavic....@gmail.com> wrote:
>
> Hi Ivan,
>
> On Tue, Oct 15, 2019 at 11:13 AM Ivan Zhakov <i...@apache.org> wrote:
> >
> > On Tue, 15 Oct 2019 at 01:16, Yann Ylavic <ylavic....@gmail.com> wrote:
> > >
> > > Then I don't know if we care enough about WIN32, but if so the
> > > InterlockedCompareExchange64() call here is maybe a bit of a hammer
> > > for usual apr_atomic_read64() usage. For things like lightly check for
> > > the current 64bit shared value and do the heavy CAS work for some
> > > special/unlikely condition only, this change is going to kill
> > > performances. Since we provide _cas64() already, I fell that
> > > read64/set64() should always be lightweight (when/if possible).
> > >
> > I considered this optimization, but decided not to implement it to
> > avoid additional #ifdef.
> >
> > Do you have any evidence that CAS would be significantly slower than
> > just read on x64?
>
> Consider a code like this:
>
>     apr_uint64_t old;
>     do {
>         old = apr_atomic_read64(shared);
>     } while (apr_atomic_cas64(shared, old + 1, old) != old);
>
> which is a common lock-free pattern (here it's a re-implementation of
> apr_atomic_inc64() with cas64, but it's just for the example).
>
> A "lock cmpxchg" for apr_atomic_read64() is going to "cost" one more
> full barrier (i.e. serialization of outstanding loads and stores on
> all cores) per loop, while the loop is intended to spin when the race
> is lost.
> The minimum garantee needed for read64() here is a single/atomic
> memory access, which is clearly not the case on WIN32 according to
> your results (see my comments below), but which should be the case on
> 64bit CPUs Windows cares about (and if we care about arm64 too, we
> should use ReadAcquire64() here to keep the same acquire/release
> ordering garantees than on x86).
> So, IMHO, the full barrier for apr_atomic_read64() is overkill on
> 64bit CPUs with strong ordering garantees (like x86_64), and if one
> wants that s/he can still use apr_atomic_cas64(mem, 0, 0) explicitely.
>
> I wrote the attached test and ran it on my linux x86_64 quad-core-CPU
> (8 w/ hyperthreading) laptop.
> The test is essentially starting T threads which run the above loop N
> times to update C counters, and finally prints the average time and
> number of (re)tries (per loop).
> The implementation of apr_atomic_read64 (volatile or cas64), T, N and
> C can be specified by the command line.
>
> Some results:
> $ (T=10000000; N=8; C=4; \
>    for i in {1..2}; do
>       for impl in cas64 volatile; do
>          echo "# $impl; T=$T; N=$N; C=$C";
>          ./volatile_vs_cas64 $impl $T $N $C;
>       done;
>    done)
> # cas64; T=10000000; N=8; C=4
> tries = 1.629 (per loop)
> usecs = 0.486 (per loop)
> # volatile; T=10000000; N=8; C=4
> tries = 1.713 (per loop)
> usecs = 0.296 (per loop)
> # cas64; T=10000000; N=8; C=4
> tries = 1.646 (per loop)
> usecs = 0.527 (per loop)
> # volatile; T=10000000; N=8; C=4
> tries = 1.722 (per loop)
> usecs = 0.371 (per loop)
>
>
> Multiple runs like this, including with other parameters, seem to
> indicate that cas64 causes less retries but takes overall more time to
> complete.
> My interpretation is that the serialization in read64 gives the
> following cas64 more chances to use an up-to-date old value, but it
> also kills CPU caching.
> On the other end, volatile access needs no cache consistency thus the
> old value is less accurate, but spinning is cheaper (faster).
>
> This kind of micro-benching is of course to be taken with a pinch of
> salt, and can't replace a real workload..
>
> >
> > But I can add special code for x64 if you think it's worth it, along the 
> > lines
> > of the attached patch.
>
> I think it's worth it, though I have no real proof...
Overall, that seems to be worthwhile. I've added optimization for
apr_atomic_read64() on Windows x64 in r1868502. Thanks!

I didn't change apr_atomic_set64() since it was using
InterlockedExchange() before my commit.

> > > So possibly r1868129 is not necessary, or if
> > > ReadAcquire64/WriteRelease64() can be used easily could we do that
> > > instead? (having a look at the WIN32 assembly code for
> > > apr_atomic_read64/set64() using ReadAcquire64/WriteRelease64() could
> > > be instructive, though)
> > >
> > The r1868129 is necessary: there is test in r1868129 that fails on x86,
> > without fix.
>
> OK, I was hoping that "volatile" could be special enough for the
> compiler to issue a single load, but the limited number of 64bit
> registers on x86 (the x87 one mainly) make this unrealistic for the
> general case. So unless we use assembly language here (or C11), we
> can't have this garantee.
>
> >
> > While ReadAcquire64()/WriteRelease64() look promising,
> > 1. They doesn't provide atomic read: test added in r1868129 fails if
> >    I replace InterlockedCompareExchange64() to ReadAcquire64() in
> >    apr_atomic_read64()..
>
> That's on WIN32 right? Or it also fails on WIN64??
>
It fails in 32-bit Windows, but works on Windows x64. I don't have
ARM64 to test.

> > 2. They are undocumented.
>
> Indeed.. Since we don't support arm64 CPUs (AFAIK), your
> "apr-optimize-read64-set64-v1" patch (which does the same) works for
> me.
I've changed condition to be _M_X64, i.e. only x86_x64 will be using
direct memory read.

-- 
Ivan Zhakov

Reply via email to