Hello Adam and all,

please see below a reply from Andrej Podzimek commenting in much detail on your observations of his RCU implementation.


M.D.

-------- Original Message --------
Subject: Re: Fwd: [HelenOS-devel] RCU algorithm review
Date: Sun, 17 Jun 2012 01:08:27 +0200
From: Andrej Podzimek <[email protected]>
To: Martin Decky <[email protected]>

Race condition?
---------------
Unless I am missing something the algorithm appears
to be suffering from a race condition between the
detector and reclaimer threads.

Detector's loop is:
     sleep
     ++cur_gp
     signal reclaimers a GP ended
     wait for preexisting readers

Instead of:
     sleep
     ++cur_gp
     wait for preexisting readers
     signal reclaimers a GP ended

As a result, in the following sequence, callback B()
is invoked before the preexisting reader R exits:
     (reclaimer)     (detector)      (reader)
     CPU 1           CPU 2           CPU 3
     next_cbs += A
     signal detector
     wait for detector
     (sleeps)
                     ++cur_gp
                     signal reclaimer
                     wait for readers
                                     R lock()
     (different thread)
     next_cbs += B
     (reclaimer wakes up)
     exec cur_cbs
     move next_cbs to cur_cbs
     signal detector
                     ++cur_gp
                     signal reclaimer

^^^ The second "signal reclaimer" call is a wrong assumption, it cannot happen here. The detector will *not* announce a GP boundary while CPU 3 is still inside a read-side critical section that had started *before* the reclaimer on CPU 1 signalled the detector. Let's see when the detector can actually announce the GP boundary:

A) CPU3 would have to announce that it had already observed the *last* ++cur_gp (by assigning the new value to its local copy of cur_gp). Well, but this did *not* happen yet -- there has been no R unlock() and read-side sections in interrupt handlers obey the nesting counter (just like any possible nested R (un)lock() calls in general). So CPU 3 will have a stale copy of the GP counter and the detector will have to wait for it.

B) Of course, the interleaving of instructions cannot be perceived as something constant ... so let's assume, just for fun, that R lock() occured *after* the last ++cur_gp, as observed by some other thread. Then R lock() must have set CPU 3's local copy of cur_gp to the last value and the detector will not wait for CPU 3. But again, this is not a problem, since then CPU 3 had already observed the changes that resulted in issuing callback B *before* starting the read-side section (and called the conditional barrier in R lock()). It is therefore perfectly fine to execute callback B in parallel with CPU 3's read-side section.

C) The detector will simply wait until CPU 3 either announces the counter observation at R unlock() or context-switches or gets idle or switches to userspace... If the detector waits for too long (more than a small N of lbolt ticks) or gets a high water mark announcement (from reclaimers), then it will try to forcibly context-switch itself to CPU 3 ... which will *not* succeed before the non-preemptible read-side section is over.

     exec cur_cbs (including B!!)    R still locked!!
                     wait for readers

To sum up, maybe I am missing something, but I really cannot see a race condition here.

If there was a race condition of this magnitude, the hash table benchmarks (combined with thorough tests using bit patterns) would probably crash after milliseconds. However, they have neither crashed nor observed any inconsistent data modifications so far. (Admittedly, they have only run for about two weeks on all the machines combined so far...)

Anyway, it is *great* that someone spends time and effort on exploring the algorithm and searches for possible shortcomings. I have found a bug in the hash table implementation approximately 6 months ago. The bug only occurred under the most aggressive experimental optimization mode of the Intel compiler. One single volatile cast fixed the bug. Yet one year earlier, I had been convinced that no inconsistency of that kind could ever be observed... (Notes about this interesting issue can still be found in the comments.) So again, code reviews are really important and many thanks for this one.

+ low overhead reader_lock/unlock (1 MB per GP and cpu)

The overhead might be higher than that, unfortunately, although I'm not sure abou this. The MB should occur only when the global GP counter changes ... theoretically. But in practice, the CPU may issue the MB speculatively, just to make sure it can safely take the alternative branch at any time. Hard to guess whether this actually happens.

Linux uses prediction hints (defined as macros replacing the if statements) that are somehow reflected in the machine code (on platforms that support them, e.g., SPARC has those probably taken / probably not taken bits that can be set in each branch instruction). Perhaps this could help. In such case the fast path could be marked as "probably taken", in an attempt to prevent the CPU from issuing the MB speculatively.

But this is just a wild guess. Someone would have to define those special if statements (so far UTS only uses standard C if, no macro replacements thereof) and *benchmark* it. AFAIK, perhaps some alternative if statements are supported by GCC directly (and GCC is now the default compiler for Illumos).

- GP latency depends on cache coherency

There is a reasonable upper bound on GP duartion as long as all read-side sections behave correctly (and they should -- we are in the f...ing kernel). So the bare fact that the average / expected / whatever latency is influenced by UMA/NUMA and/or cache coherency protocols is not that disastrous.

In some cases, longer GP durations save a lot of context-switching effort by batched reclamation processing.

- forces context switches of long running threads
   without RCU readers

True. But this does not occur too often (from a machine's perspective).

It would be nice if some threads could be explicitly marked as "free of RCU readers". The idle thread is (implicitly) marked that way right now and so is all userspace processing. :-)

Unfortunately, an attempt to set a "free of RCU readers" flag on each R unlock() would (as one can imagine) lead to ugly race conditions, so all ideas of this kind need to be considered very carefully.

Long-running kernel threads would (of course) be interrupted by the detector, but *most* long-running threads run in userspace (and so the detector does not need to take them into account and interrupt their CPUs at all). Long-running threads performing computations are *rare* in the kernel. Some of them are related to ZFS, but they use the SYSDC scheduling class, which means that they can be preempted by any other threads (including userspace threads) quite often.

- reclaimers and detector serialized via a single mutex

A mutex *and* a condition variable, which is slightly more scalable than a mutex alone. Furthermore, reclaimers process potentially long batches of callbacks with the mutex unlocked and the detector also keeps it unlocked most of the time. The detector prolongs grace period duration on purpose (when there is no high water announcement) by a small N of lbolt ticks, which reduces the frequency of competition for the mutex even further.

However, it is definitely true that a better solution *must* be found for large NUMA systems. It would be nice to have one detector per NUMA node and a tree-based GP announcement protocol (skipping inactive nodes (of course!) and supporting multiple instances of the mechanism, thus keeping the GP detection as local as possible, not competing with unrelated threads from distant NUMA nodes).

wait_on_gp - GP num whose end we are waiting for to
    dispatch callbacks in cur_cbs (but it is not used
    that way; see discussion of race condition)

It's important to understand this basic idea correctly. A callback does *not* wait for the "end" of a particular GP. It waits for (at least) *two* GP boundaries to elapse. This is why reclaimers have two queues of callbacks -- one of them can be (and actually *is*) extended in parallel with the reclaimer's work, while the other one is waiting for one more GP boundary to occur and is not extended any longer. On each GP boundary, the queue with old callbacks is processed and swapped with the new queue atomically. Threads external to the RCU mechanism always wait for two GP boundaries to occur -- they are (in general) not expected to have the capability to split their work into smaller per-GP-boundary batches the way reclaimers do.

The algorithm relies on cpu's cache-coherency protocol
to deliver (make visible) the updated cur_gp to readers.
The same applies in the other direction when loading
last_seen_gp in the detector.

It's just a heuristic and no more than that.

If the coherency protocol doesn't make the updated cur_gp visible to readers, they will not announce a GP observation and potentially prevent the detector from taking the fast path. If the coherency protocol doesn't make the updated copies of cur_gp visible to the detector, the detector will (again) be unable to take the fast path. However, this does not impact the correctness of the algorithm.

Stressing the RCU mechanism extremely on a small NUMA system showed that the cache coherency protocol works fine and that the heuristic helps a *lot*. The number of context switches per GP caused by the detector thread seemed to have an exponential-like distribution with an expected value around 1 on a 24-CPU system. But admittedly, this was related to the frequency of context switches in our benchmarking workload (which is rather high on purpose). Furthermore, it would be nice to measure this *precisely*, rather than just guessing from context switch counters (and from a crippled detector that starts on the same CPU all the time instead of rotating over all the CPUs evenly).

If I am not mistaken, the implementation [6] checks
for threads being idle or in user space with
CPU->cpu_mstate in OpenSolaris. However, OpenSolaris
does not surround changes of cpu_mstate with MBs (for
performance reasons [7]); therefore, reading its value
from the detector is racy (eg if a CS immediatelly
follows the change).

It might seem prone to races with one CS immediately following an idle / userspace state. However, this is why the GP boundaries are announced rather than GP "endings". The race condition you mention will never occur twice in a sequence. Yet RCU callbacks (as well as external threads asking for GP detection) are always delayed until at least *two* GP boundaries are observed.

^^^ If I am mistaken here, and example of a pathological scenario would be very helpful. The experience with the non-blocking hash table taught me one important fact: Intuition is a good servant, but a poor master. :-) So none of my explanations can be taken for granted -- there can still be bugs in both my thoughts and the algorithm.

_______________________________________________
HelenOS-devel mailing list
[email protected]
http://lists.modry.cz/cgi-bin/listinfo/helenos-devel

Reply via email to