Hello Adam and all,
please see below a reply from Andrej Podzimek commenting in much detail
on your observations of his RCU implementation.
M.D.
-------- Original Message --------
Subject: Re: Fwd: [HelenOS-devel] RCU algorithm review
Date: Sun, 17 Jun 2012 01:08:27 +0200
From: Andrej Podzimek <[email protected]>
To: Martin Decky <[email protected]>
Race condition?
---------------
Unless I am missing something the algorithm appears
to be suffering from a race condition between the
detector and reclaimer threads.
Detector's loop is:
sleep
++cur_gp
signal reclaimers a GP ended
wait for preexisting readers
Instead of:
sleep
++cur_gp
wait for preexisting readers
signal reclaimers a GP ended
As a result, in the following sequence, callback B()
is invoked before the preexisting reader R exits:
(reclaimer) (detector) (reader)
CPU 1 CPU 2 CPU 3
next_cbs += A
signal detector
wait for detector
(sleeps)
++cur_gp
signal reclaimer
wait for readers
R lock()
(different thread)
next_cbs += B
(reclaimer wakes up)
exec cur_cbs
move next_cbs to cur_cbs
signal detector
++cur_gp
signal reclaimer
^^^ The second "signal reclaimer" call is a wrong assumption, it cannot
happen here. The detector will *not* announce a GP boundary while CPU 3
is still inside a read-side critical section that had started *before*
the reclaimer on CPU 1 signalled the detector. Let's see when the
detector can actually announce the GP boundary:
A) CPU3 would have to announce that it had already observed the *last*
++cur_gp (by assigning the new value to its local copy of cur_gp). Well,
but this did *not* happen yet -- there has been no R unlock() and
read-side sections in interrupt handlers obey the nesting counter (just
like any possible nested R (un)lock() calls in general). So CPU 3 will
have a stale copy of the GP counter and the detector will have to wait
for it.
B) Of course, the interleaving of instructions cannot be perceived as
something constant ... so let's assume, just for fun, that R lock()
occured *after* the last ++cur_gp, as observed by some other thread.
Then R lock() must have set CPU 3's local copy of cur_gp to the last
value and the detector will not wait for CPU 3. But again, this is not a
problem, since then CPU 3 had already observed the changes that resulted
in issuing callback B *before* starting the read-side section (and
called the conditional barrier in R lock()). It is therefore perfectly
fine to execute callback B in parallel with CPU 3's read-side section.
C) The detector will simply wait until CPU 3 either announces the
counter observation at R unlock() or context-switches or gets idle or
switches to userspace... If the detector waits for too long (more than a
small N of lbolt ticks) or gets a high water mark announcement (from
reclaimers), then it will try to forcibly context-switch itself to CPU 3
... which will *not* succeed before the non-preemptible read-side
section is over.
exec cur_cbs (including B!!) R still locked!!
wait for readers
To sum up, maybe I am missing something, but I really cannot see a race
condition here.
If there was a race condition of this magnitude, the hash table
benchmarks (combined with thorough tests using bit patterns) would
probably crash after milliseconds. However, they have neither crashed
nor observed any inconsistent data modifications so far. (Admittedly,
they have only run for about two weeks on all the machines combined so
far...)
Anyway, it is *great* that someone spends time and effort on exploring
the algorithm and searches for possible shortcomings. I have found a bug
in the hash table implementation approximately 6 months ago. The bug
only occurred under the most aggressive experimental optimization mode
of the Intel compiler. One single volatile cast fixed the bug. Yet one
year earlier, I had been convinced that no inconsistency of that kind
could ever be observed... (Notes about this interesting issue can still
be found in the comments.) So again, code reviews are really important
and many thanks for this one.
+ low overhead reader_lock/unlock (1 MB per GP and cpu)
The overhead might be higher than that, unfortunately, although I'm not
sure abou this. The MB should occur only when the global GP counter
changes ... theoretically. But in practice, the CPU may issue the MB
speculatively, just to make sure it can safely take the alternative
branch at any time. Hard to guess whether this actually happens.
Linux uses prediction hints (defined as macros replacing the if
statements) that are somehow reflected in the machine code (on platforms
that support them, e.g., SPARC has those probably taken / probably not
taken bits that can be set in each branch instruction). Perhaps this
could help. In such case the fast path could be marked as "probably
taken", in an attempt to prevent the CPU from issuing the MB speculatively.
But this is just a wild guess. Someone would have to define those
special if statements (so far UTS only uses standard C if, no macro
replacements thereof) and *benchmark* it. AFAIK, perhaps some
alternative if statements are supported by GCC directly (and GCC is now
the default compiler for Illumos).
- GP latency depends on cache coherency
There is a reasonable upper bound on GP duartion as long as all
read-side sections behave correctly (and they should -- we are in the
f...ing kernel). So the bare fact that the average / expected / whatever
latency is influenced by UMA/NUMA and/or cache coherency protocols is
not that disastrous.
In some cases, longer GP durations save a lot of context-switching
effort by batched reclamation processing.
- forces context switches of long running threads
without RCU readers
True. But this does not occur too often (from a machine's perspective).
It would be nice if some threads could be explicitly marked as "free of
RCU readers". The idle thread is (implicitly) marked that way right now
and so is all userspace processing. :-)
Unfortunately, an attempt to set a "free of RCU readers" flag on each R
unlock() would (as one can imagine) lead to ugly race conditions, so all
ideas of this kind need to be considered very carefully.
Long-running kernel threads would (of course) be interrupted by the
detector, but *most* long-running threads run in userspace (and so the
detector does not need to take them into account and interrupt their
CPUs at all). Long-running threads performing computations are *rare* in
the kernel. Some of them are related to ZFS, but they use the SYSDC
scheduling class, which means that they can be preempted by any other
threads (including userspace threads) quite often.
- reclaimers and detector serialized via a single mutex
A mutex *and* a condition variable, which is slightly more scalable than
a mutex alone. Furthermore, reclaimers process potentially long batches
of callbacks with the mutex unlocked and the detector also keeps it
unlocked most of the time. The detector prolongs grace period duration
on purpose (when there is no high water announcement) by a small N of
lbolt ticks, which reduces the frequency of competition for the mutex
even further.
However, it is definitely true that a better solution *must* be found
for large NUMA systems. It would be nice to have one detector per NUMA
node and a tree-based GP announcement protocol (skipping inactive nodes
(of course!) and supporting multiple instances of the mechanism, thus
keeping the GP detection as local as possible, not competing with
unrelated threads from distant NUMA nodes).
wait_on_gp - GP num whose end we are waiting for to
dispatch callbacks in cur_cbs (but it is not used
that way; see discussion of race condition)
It's important to understand this basic idea correctly. A callback does
*not* wait for the "end" of a particular GP. It waits for (at least)
*two* GP boundaries to elapse. This is why reclaimers have two queues of
callbacks -- one of them can be (and actually *is*) extended in parallel
with the reclaimer's work, while the other one is waiting for one more
GP boundary to occur and is not extended any longer. On each GP
boundary, the queue with old callbacks is processed and swapped with the
new queue atomically. Threads external to the RCU mechanism always wait
for two GP boundaries to occur -- they are (in general) not expected to
have the capability to split their work into smaller per-GP-boundary
batches the way reclaimers do.
The algorithm relies on cpu's cache-coherency protocol
to deliver (make visible) the updated cur_gp to readers.
The same applies in the other direction when loading
last_seen_gp in the detector.
It's just a heuristic and no more than that.
If the coherency protocol doesn't make the updated cur_gp visible to
readers, they will not announce a GP observation and potentially prevent
the detector from taking the fast path. If the coherency protocol
doesn't make the updated copies of cur_gp visible to the detector, the
detector will (again) be unable to take the fast path. However, this
does not impact the correctness of the algorithm.
Stressing the RCU mechanism extremely on a small NUMA system showed that
the cache coherency protocol works fine and that the heuristic helps a
*lot*. The number of context switches per GP caused by the detector
thread seemed to have an exponential-like distribution with an expected
value around 1 on a 24-CPU system. But admittedly, this was related to
the frequency of context switches in our benchmarking workload (which is
rather high on purpose). Furthermore, it would be nice to measure this
*precisely*, rather than just guessing from context switch counters (and
from a crippled detector that starts on the same CPU all the time
instead of rotating over all the CPUs evenly).
If I am not mistaken, the implementation [6] checks
for threads being idle or in user space with
CPU->cpu_mstate in OpenSolaris. However, OpenSolaris
does not surround changes of cpu_mstate with MBs (for
performance reasons [7]); therefore, reading its value
from the detector is racy (eg if a CS immediatelly
follows the change).
It might seem prone to races with one CS immediately following an idle /
userspace state. However, this is why the GP boundaries are announced
rather than GP "endings". The race condition you mention will never
occur twice in a sequence. Yet RCU callbacks (as well as external
threads asking for GP detection) are always delayed until at least *two*
GP boundaries are observed.
^^^ If I am mistaken here, and example of a pathological scenario would
be very helpful. The experience with the non-blocking hash table taught
me one important fact: Intuition is a good servant, but a poor master.
:-) So none of my explanations can be taken for granted -- there can
still be bugs in both my thoughts and the algorithm.
_______________________________________________
HelenOS-devel mailing list
[email protected]
http://lists.modry.cz/cgi-bin/listinfo/helenos-devel