[HelenOS-devel] Fwd: Re: RCU algorithm review

Martin Decky Sun, 17 Jun 2012 04:23:38 -0700

Hello Adam and all,

please see below a reply from Andrej Podzimek commenting in much detailon your observations of his RCU implementation.



M.D.

-------- Original Message --------
Subject: Re: Fwd: [HelenOS-devel] RCU algorithm review
Date: Sun, 17 Jun 2012 01:08:27 +0200
From: Andrej Podzimek <[email protected]>
To: Martin Decky <[email protected]>

Race condition?
---------------
Unless I am missing something the algorithm appears
to be suffering from a race condition between the
detector and reclaimer threads.

Detector's loop is:
     sleep
     ++cur_gp
     signal reclaimers a GP ended
     wait for preexisting readers

Instead of:
     sleep
     ++cur_gp
     wait for preexisting readers
     signal reclaimers a GP ended

As a result, in the following sequence, callback B()
is invoked before the preexisting reader R exits:
     (reclaimer)     (detector)      (reader)
     CPU 1           CPU 2           CPU 3
     next_cbs += A
     signal detector
     wait for detector
     (sleeps)
                     ++cur_gp
                     signal reclaimer
                     wait for readers
                                     R lock()
     (different thread)
     next_cbs += B
     (reclaimer wakes up)
     exec cur_cbs
     move next_cbs to cur_cbs
     signal detector
                     ++cur_gp
                     signal reclaimer

^^^ The second "signal reclaimer" call is a wrong assumption, it cannothappen here. The detector will *not* announce a GP boundary while CPU 3is still inside a read-side critical section that had started *before*the reclaimer on CPU 1 signalled the detector. Let's see when thedetector can actually announce the GP boundary:

A) CPU3 would have to announce that it had already observed the *last*++cur_gp (by assigning the new value to its local copy of cur_gp). Well,but this did *not* happen yet -- there has been no R unlock() andread-side sections in interrupt handlers obey the nesting counter (justlike any possible nested R (un)lock() calls in general). So CPU 3 willhave a stale copy of the GP counter and the detector will have to waitfor it.

B) Of course, the interleaving of instructions cannot be perceived assomething constant ... so let's assume, just for fun, that R lock()occured *after* the last ++cur_gp, as observed by some other thread.Then R lock() must have set CPU 3's local copy of cur_gp to the lastvalue and the detector will not wait for CPU 3. But again, this is not aproblem, since then CPU 3 had already observed the changes that resultedin issuing callback B *before* starting the read-side section (andcalled the conditional barrier in R lock()). It is therefore perfectlyfine to execute callback B in parallel with CPU 3's read-side section.

C) The detector will simply wait until CPU 3 either announces thecounter observation at R unlock() or context-switches or gets idle orswitches to userspace... If the detector waits for too long (more than asmall N of lbolt ticks) or gets a high water mark announcement (fromreclaimers), then it will try to forcibly context-switch itself to CPU 3... which will *not* succeed before the non-preemptible read-sidesection is over.

     exec cur_cbs (including B!!)    R still locked!!
                     wait for readers

To sum up, maybe I am missing something, but I really cannot see a racecondition here.

If there was a race condition of this magnitude, the hash tablebenchmarks (combined with thorough tests using bit patterns) wouldprobably crash after milliseconds. However, they have neither crashednor observed any inconsistent data modifications so far. (Admittedly,they have only run for about two weeks on all the machines combined sofar...)

Anyway, it is *great* that someone spends time and effort on exploringthe algorithm and searches for possible shortcomings. I have found a bugin the hash table implementation approximately 6 months ago. The bugonly occurred under the most aggressive experimental optimization modeof the Intel compiler. One single volatile cast fixed the bug. Yet oneyear earlier, I had been convinced that no inconsistency of that kindcould ever be observed... (Notes about this interesting issue can stillbe found in the comments.) So again, code reviews are really importantand many thanks for this one.

+ low overhead reader_lock/unlock (1 MB per GP and cpu)

The overhead might be higher than that, unfortunately, although I'm notsure abou this. The MB should occur only when the global GP counterchanges ... theoretically. But in practice, the CPU may issue the MBspeculatively, just to make sure it can safely take the alternativebranch at any time. Hard to guess whether this actually happens.

Linux uses prediction hints (defined as macros replacing the ifstatements) that are somehow reflected in the machine code (on platformsthat support them, e.g., SPARC has those probably taken / probably nottaken bits that can be set in each branch instruction). Perhaps thiscould help. In such case the fast path could be marked as "probablytaken", in an attempt to prevent the CPU from issuing the MB speculatively.

But this is just a wild guess. Someone would have to define thosespecial if statements (so far UTS only uses standard C if, no macroreplacements thereof) and *benchmark* it. AFAIK, perhaps somealternative if statements are supported by GCC directly (and GCC is nowthe default compiler for Illumos).

- GP latency depends on cache coherency

There is a reasonable upper bound on GP duartion as long as allread-side sections behave correctly (and they should -- we are in thef...ing kernel). So the bare fact that the average / expected / whateverlatency is influenced by UMA/NUMA and/or cache coherency protocols isnot that disastrous.

In some cases, longer GP durations save a lot of context-switchingeffort by batched reclamation processing.

- forces context switches of long running threads
   without RCU readers


True. But this does not occur too often (from a machine's perspective).

It would be nice if some threads could be explicitly marked as "free ofRCU readers". The idle thread is (implicitly) marked that way right nowand so is all userspace processing. :-)

Unfortunately, an attempt to set a "free of RCU readers" flag on each Runlock() would (as one can imagine) lead to ugly race conditions, so allideas of this kind need to be considered very carefully.

Long-running kernel threads would (of course) be interrupted by thedetector, but *most* long-running threads run in userspace (and so thedetector does not need to take them into account and interrupt theirCPUs at all). Long-running threads performing computations are *rare* inthe kernel. Some of them are related to ZFS, but they use the SYSDCscheduling class, which means that they can be preempted by any otherthreads (including userspace threads) quite often.

- reclaimers and detector serialized via a single mutex

A mutex *and* a condition variable, which is slightly more scalable thana mutex alone. Furthermore, reclaimers process potentially long batchesof callbacks with the mutex unlocked and the detector also keeps itunlocked most of the time. The detector prolongs grace period durationon purpose (when there is no high water announcement) by a small N oflbolt ticks, which reduces the frequency of competition for the mutexeven further.

However, it is definitely true that a better solution *must* be foundfor large NUMA systems. It would be nice to have one detector per NUMAnode and a tree-based GP announcement protocol (skipping inactive nodes(of course!) and supporting multiple instances of the mechanism, thuskeeping the GP detection as local as possible, not competing withunrelated threads from distant NUMA nodes).

wait_on_gp - GP num whose end we are waiting for to
    dispatch callbacks in cur_cbs (but it is not used
    that way; see discussion of race condition)

It's important to understand this basic idea correctly. A callback does*not* wait for the "end" of a particular GP. It waits for (at least)*two* GP boundaries to elapse. This is why reclaimers have two queues ofcallbacks -- one of them can be (and actually *is*) extended in parallelwith the reclaimer's work, while the other one is waiting for one moreGP boundary to occur and is not extended any longer. On each GPboundary, the queue with old callbacks is processed and swapped with thenew queue atomically. Threads external to the RCU mechanism always waitfor two GP boundaries to occur -- they are (in general) not expected tohave the capability to split their work into smaller per-GP-boundarybatches the way reclaimers do.

The algorithm relies on cpu's cache-coherency protocol
to deliver (make visible) the updated cur_gp to readers.
The same applies in the other direction when loading
last_seen_gp in the detector.


It's just a heuristic and no more than that.

If the coherency protocol doesn't make the updated cur_gp visible toreaders, they will not announce a GP observation and potentially preventthe detector from taking the fast path. If the coherency protocoldoesn't make the updated copies of cur_gp visible to the detector, thedetector will (again) be unable to take the fast path. However, thisdoes not impact the correctness of the algorithm.

Stressing the RCU mechanism extremely on a small NUMA system showed thatthe cache coherency protocol works fine and that the heuristic helps a*lot*. The number of context switches per GP caused by the detectorthread seemed to have an exponential-like distribution with an expectedvalue around 1 on a 24-CPU system. But admittedly, this was related tothe frequency of context switches in our benchmarking workload (which israther high on purpose). Furthermore, it would be nice to measure this*precisely*, rather than just guessing from context switch counters (andfrom a crippled detector that starts on the same CPU all the timeinstead of rotating over all the CPUs evenly).

If I am not mistaken, the implementation [6] checks
for threads being idle or in user space with
CPU->cpu_mstate in OpenSolaris. However, OpenSolaris
does not surround changes of cpu_mstate with MBs (for
performance reasons [7]); therefore, reading its value
from the detector is racy (eg if a CS immediatelly
follows the change).

It might seem prone to races with one CS immediately following an idle /userspace state. However, this is why the GP boundaries are announcedrather than GP "endings". The race condition you mention will neveroccur twice in a sequence. Yet RCU callbacks (as well as externalthreads asking for GP detection) are always delayed until at least *two*GP boundaries are observed.

^^^ If I am mistaken here, and example of a pathological scenario wouldbe very helpful. The experience with the non-blocking hash table taughtme one important fact: Intuition is a good servant, but a poor master.:-) So none of my explanations can be taken for granted -- there canstill be bugs in both my thoughts and the algorithm.


_______________________________________________
HelenOS-devel mailing list
[email protected]
http://lists.modry.cz/cgi-bin/listinfo/helenos-devel

[HelenOS-devel] Fwd: Re: RCU algorithm review

Reply via email to