[HelenOS-devel] RCU algorithm review

Adam Hraska Fri, 15 Jun 2012 14:45:10 -0700

Hi,

This email reviews existing RCU algorithms in preparation
for a short email discussing which one to implement.


I would like to kindly ask someone to check whether I
correctly understand how MBs work and read "Visibility
vs MBs".


At the beginning of each algorithm's section there are
a few points highlighting the algo's strengths/weaknesses.
Please, focus on this part and read the description or
pseudocode only if you are genuinely interested in how
the algos work.

And do not worry: I will *not* be sending another
humongous email any time soon.


Terminology, notation
---------------------
MB = full memory barrier
CS = RCU critical section between reader_lock/unlock();
        the thread may hold references to RCU protected data
QS = quiescent state; the thread/reader is not in a CS
        and is not accessing any RCU protected (not even
        if considering instruction reordering)
GP = grace period; (any) period of time such that each
        thread passes through a QS, ie all CS existing at
        the beginning of a GP will have completed before
        the GP ends
GP latency = the time is takes to detect that all threads
        have reached a QS measured from the moment they
        all really do pass a QS (ie, latency should ideally
        be zero).
reader = thread that executes a CS during a GP
writer = thread that changes RCU protected data, possibly
        outside of an RCU CS.
        
atomic { stmts } = statements are run atomically, but
        the block of statements can be reordered with other
        non-atomic memory accesses.
atomic_w_MB { stmts } = atomic { MB; stms; MB }
        similar to Promela atomic blocks
wait until (cond) = waits (eg sleeps) until the condition
        becomes true and another thread "signals" that the
        condition changed
wrt = with respect to

RCU semantics recap
-------------------
I expect you are familiar with RCU's semantics, so here
is just a quick recap.

RCU coordinates concurrent execution of a reader with
other readers or writers. It does not however synchronize
writers with other writers in any way.

These operations are provided:
reader_lock(), reader_unlock(), synchronize(), call_rcu()

Readers always run concurrently with other readers and
even writers, ie reader_lock/unlock() never block.

A writer can coordinate with readers by means of call_rcu()
or synchronize(). synchronize() blocks for an entire
GP. A callback registered via call_rcu() will be invoked
after a full GP elapses. Therefore, they provide these
guarantees:
1) Waits for any readers that may be accessing old
        versions of the data that the writer changed prior
        to invoking call_rcu() or synchronize(). Once
        sync() exits or call_rcu()'s callback is run, we
        are guaranteed that all preexisting readers have
        finished and noone is accessing an old version
        of the data.
2) Ensures all new readers (those that sync() does
        not wait for) see writer's changes made prior
        to eg sync().

Notice:
- Readers running concurrently with synchronize() may
        or may not see any changes; until sync() exits we
        just don't know.
- If a reader loads a variable changed by a writer
        twice in a single CS, it may get an old value
        for one of the loads and the new value for the
        other load.
- Writers must coordinate by other means, eg locking
        or atomic operations.
- There are little guarantees as to when changes
        from a *reader* CS are visible to all readers
        unless the CS is followed by a sync()/call_rcu().
        As far as modifications in a reader CS go, running
        a sync() after the reader exits its CS but on a
        *different* cpu guarantees next to nothing!

Visibility vs MBs
-----------------
I base the following discussion on [18, 19, 20].

In order for a store on one cpu (CPU_1) to become
visible to a load on another cpu (CPU_2):
1) CPU_1 must execute a MB after the store (wrt
        instruction order).
2) After CPU_1's MB completes (wrt cache-coherency
        bus traffic) CPU_2 must execute a MB. Then (wrt
        instruction order) CPU_2 can issue a load which
        is guaranteed to see the stored value.

If either of the MBs are omitted, CPU_2 may never
load the stored value (even if it loads it in a loop).
In practice, CPU_2 will eventually see the new value.
Due to having a store buffer and an invalidate queue
limited in size, a cpu in the system would eventually
have to stall if stores were to be hidden indefinitely
[18].

What I would like to know is how long it takes for
CPU_2 to first see CPU_1's store if CPU_2 omits its
MB, but CPU_1 does not. CPU_2's cache may be busy, so
its invalidate queue may not get to be processed [18].
However, CPU_2's performance is not affected. It
can continue working with its cache without worrying
about the inv. queue (that's why the queue is not
been processed by the cache in the first place - the
cpu is busy using its cache). Therefore, CPU_2 has no
motivation to process the queue. Even if the queue
becomes full, at worst some other cpu will stall but
not CPU_2.

Does this effect occur in practice? If so, for how
long can CPU_2 go on without seeing the store? Is
it on the order of milliseconds or can it theoretically
last forever (disregarding context switches)?

Some RCU algorithms depend on this effect to a greater
extend (eg podzimek-rcu). If omitting a MB in CPU_2 is
a real problem, it is necessary to account for threads
running with large time quanta. Otherwise, RCU processing
could be significantly delayed which could in turn lead
memory exhaustion.

According to [18] ia64 and Power cpus reorder memory
accesses aggressively; quite unlike x86.


General URCU (gen-urcu) [1]
============
? userspace rcu
+ readers may be preempted
+ simple
+ low latency GP detection
- readers always issue MBs
- GP twice as long as needs to be in the common case
- threads polled for QSs serially

Data
----
per-thread:
nesting_count - number of nested reader_lock()s; if 0,
        the thread is not in a CS.
reader_group - 0 or 1; identifies this thread with either
        preexisting or new readers once a GP detection begins

global: (modification w/ mutex)
cur_reader_group - 0 or 1; new readers are assigned this
        group number.

Algorithm (per-thread 2 counters w/ flip)
---------
Readers increment their nesting_count. The outer-most
reader also stores locally the cur_reader_group and then
issues a MB to make the nesting_count change visible
to the GP detecting code. Simultaneously the MB forces
the cpu to contain memory accesses of the CS inside
reader_lock/unlock.

In order to detect a full GP we flip cur_reader_group.
New readers will therefore associate their nesting_count
with this new group. Preexisting readers are associated
with the previous/old group. Since no new readers can
be associated with the old group number the number of
readers in the old group monotonically decreases and
eventually reaches zero as these preexisting readers
exit their CSs. We check every thread if it belongs
to preexisting readers (old reader_group, ie
t.reader_group != cur_reader_group && t.nesting_count > 0)
and poll it until its nesting count drops to zero.

Note 1
------
nesting_count and reader_group are packed in a single
word so that a thread can be atomically assigned
a reader_group and have its nesting_count incremented
for the first time. reader_group typically occupies
the most-significant bit of the word with nesting_count.

Note 2
------
A thread may be preempted before it stores the loaded
cur_reader_group in rcu_reader_lock. While the thread
is preempted we may flip cur_reader_group and wait
for all preexisting readers to exit. If the thread is
resumed after that this *new* reader will be erroneously
associated with the preexisting reader group even though
no synchronize() is anymore in progress. Should another
synchronize() follow immediately, it would not wait
for this reader to complete. To wait for such readers
as well, synchronize() waits in turn for new as well as
old readers. That prolongs the GP.

The same reasoning applies even if reader_lock() were
not preempted but executed between synchronize() issues
its first MB and flips cur_reader_group.

Note 3
------
There is no MB after "--nesting_count" in reader_un-
lock(); therefore, it may take some time before the
decrement is visible to synchronize(). In the worst
case, a context switch makes it visible.

Pseudocode
----------
reader_lock {
        if 0 == nesting_count
                atomic {
                        nesting_count = 1
                        reader_group = cur_reader_group
                }
                // Make nesting_count visible. Prevent
                // CS memory accesses from leaking and
                // ensure prior changes are visible in CS.
                MB
        else
                ++nesting_count
}

reader_unlock {
        // Separate CS from "--nesting_count".
        MB
        --nesting_count
}

synchronize {
        with locked mutex {
                // Make changes prior to sync() visible
                // in reader CSs.
                MB
                
                do twice {
                        // Separates old and new readers.
                        cur_reader_group = !cur_reader_group
                        for all threads
                                while thread.gp_ongoing
                                        // implicit MB
                                        sleep(10 ms)
                }
                
                // Separate from following changes.
                MB
        }
}

t.gp_ongoing {
        // in CS
        return 0 < t.nesting_count
                // a preexisting reader
                && cur_reader_group != t.reader_group
}


Signal URCU (sig-urcu) [1]
===========
? userspace rcu
+ readers may be preempted
+ reader_lock() just increaments a counter
+ simple
+ low latency GP detection
- GP detection disrupts readers
- GP twice as long as needs to be in common case
- threads polled for QSs serially

Data
----
Exactly the same as in (gen-urcu).

Algorithm (per-thread 2 counters w/ flip, MB on demand)
---------
The algorithm is the same as in (gen-urcu). (sig-urcu)
differs only in that it does away with MBs in reader
locking. Because reader_lock()s do *not* issue a
MB after ++nesting_count the cpu may reorder this
increment with instructions in the CS. Therefore,
synchronize() cannot be sure that a thread is not
executing its CS even if nesting_count == 0. Because
this is in user space synchronize() cannot detect
naturally occurring MBs (such as in a context switch).
Instead, synchronize() requests a MB in the reader
on demand. It sends all threads a signal and then
waits for signal handlers to issue a MB.

Issuing a MB has the effect of promoting compiler
optimization barriers into full MBs. To see this
imagine the MB is executed by the reader after the
compiler barrier. Other cpus can be sure all accesses
prior to the compiler barrier have completed. Similarly,
if the MB is executed before the compiler barrier,
all accesses after the compiler barrier have not
yet begun. The net effect is the same as having a MB
in place of the compiler barrier.

Pseudocode
----------
reader_lock {
        if 0 == nesting_count
                atomic {
                        nesting_count = 1
                        reader_group = cur_reader_group
                }
                compiler_barrier()
        else
                ++nesting_count
}

reader_unlock {
        compiler_barrier()
        --nesting_count
}

synchronize {
        with locked mutex {
                // Ensure changes prior to synchronize() are
                // visible to other cpus.
                MB
                
                // Make the most current nesting_count visible
                // to this cpu and ensure new readers see/load
                // changes prior to sync().
                force_mb_on_all_threads()
                
                do twice {
                        // Separates old and new readers.
                        cur_reader_group = !cur_reader_group
                        for all threads
                                while thread.gp_ongoing
                                        // Implicit MB
                                        sleep(10 ms)
                }
                
                // Old readers exited CS via reader_unlock().
                // These MBs contain any late mem accesses
                // of CS before proceeding further.
                force_mb_on_all_threads()
        }
}

force_mb_on_all_threads {
        send a signal to all threads
        wait for an acknowledgement
}

signal handler {
        MB
        acknowledge
        MB
}

t.gp_ongoing {
        // in  CS
        return 0 < t.nesting_count
                // a preexisting reader
                && cur_reader_group != t.reader_group
}


Classic RCU (ltimer-rcu, classic-rcu) [2,3]
===========
? non-preemptible reader sections
+ fast reader_lock/unlock (just disables preemption)
+ very low overhead GP detection (only observes never
        forces context switches, user mode, idle loop)
+ all cpus detect GP in parallel
- requires regular sampling timer/scheduling tick
- GP detection is held back if threads are alloted
        larger time quanta that they spend in the kernel

Data
----
per-cpu: (no locks, only local cpu may modify)
cur_cbs, next_cbs - callback queues
wait_on_gp - gp num whose end we are waiting for to
        dispatch callbacks (in cur_cbs)

global: (modifications w/ spinlock)
cur_gp - current GP we are waiting for to end
wait_on_cpus - set of cpus we are waiting for to pass
        through a QS for the current/ongoing GP to end
max_wait_on_gp - maximum of all cpus' wait_on_gp

Algorithm (wait set + independently in parallel)
---------
Readers only disable preemption.

QS is a context switch or if the regular scheduler
tick finds the current thread in user mode/idle loop.
GP cur_gp is in progress if wait_on_cpus != none.

Waiting for a grace period to end is in progress if
the set wait_on_cpus is not empty. A timer interrupt
detects if a cpu passes though a QS, ie it interrupted
a task in user mode, an idle loop or a context switch
occured since the last timer interrupt. As cpus pass
through QSs they are removed from wait_on_cpus. The
last one removed announces the end of the current grace
period by incrementing the current grace period number.
Other cpus eventually notice this change again in the
timer interrupt and dispatch waiting callbacks. To
start the detection of GP end again, one simply adds
all cpus into wait_on_cpus.

It tolerates if the cpu leaks/reorders memory accesses
out of rcu_reader_lock/rcu_reader_unlock() delimited
code because the detected QSs inherently include a MB
(ctx switch, switch to idle loop; presumably switch
to or from user mode also acts as a MB).

Reading/checking the current GP number (cur_gp) in
the timer interrupt is *not* preceded with a MB, so
it may read an old value (and delay the dispatch of
the cpu's callbacks). However, in the worst case, the
next QS (since it has a MB) will make the current value
of the global GP number visible to the reading cpu.

Note
----
There is a variation of this algorithm (Tree RCU or
Hierarchical RCU [3]) that reduces contention of the
spinlock protecting global data and the wait_on_cpus
set. It simply partitions the set and the spinlock
among groups of cpus.

Pseudocode
----------
reader_lock {
        disable preemption
}

reader_unlock {
        enable preemption
}

timer interrupt handler {
        // check w/o MB or lock whether GP wait_on_gp ended
        if wait_on_gp < cur_gp
                exec cur_cbs

        // new batch of callbacks is ready
        if cur_cbs.empty && !next_cbs.empty
                move next to cur
                with locked spinlock {
                        // callbacks batched during cur_gp
                        // wait for an entire grace period
                        wait_on_gp = cur_gp + 1
                        // gp not in progress
                        if wait_on_cpus == none
                                // start new gp
                                wait_on_cpus = all
                }
                
        if passed a QS (ie in user mode, idle loop, ctx switch)
                with locked spinlock {
                        remove me from wait_on_cpus
                        // Last cpu to pass through a QS.
                        if wait_on_cpus == none
                                // Announce that this GP ended.
                                ++cur_gp
                                // Start new GP if someone is waiting for it
                                if cur_gp <= max_wait_on_gp
                                        wait_on_cpus = all
                }
}


Houston's RCU (houston-rcu) [4]
=============
? non-preemptible reader sections
+ no sampling in scheduler tick
+ low latency GP end detection
- CAS/MB in reader_unlock()
- still requires a (bit less frequent) timer interrupt
- (some) disruption of readers during GP detection

Data (similar to classic-rcu)
----
per-cpu: (no locks, only local cpu may modify)
flags - NOT_IN_CS, IN_CS, IN_CS|ANNOUNCE_QS determine
        if this cpu is in a CS or not, and if writers are
        waiting for a QS announcement when exiting the CS.
cur_cbs, next_cbs - callback queues
wait_on_gp - gp num whose end we are waiting for to
        dispatch callbacks (in cur_cbs)
next_wait_on_gp - wait_on_gp for next_cbs

global: (modifications w/ spinlock)
cur_gp - current GP we are waiting for to end
wait_on_cpus - set of cpus we are waiting for to pass
        through a QS for the current/ongoing GP to end

Algorithm
---------
Houston's patch removes (classic-rcu)'s reliance on
regular scheduler ticks. In (classic-rcu) the scheduler
tick was used to batch callbacks, detect local QSs, and
dispatch callbacks whose GPs had ended. (houston-rcu)
moves these responsibilities to call_rcu() and reader-
_unlock(). It batches callbacks in call_rcu() until, say,
200 have been enqued. reader_unlock() announces QSs
if a GP is in progress. Both call_rcu() and rcu_reader-
_unlock() detect that a GP elapsed but leave it up to a
designated thread to invoke callbacks.

A new GP starts when enough callbacks have been queued
locally (eg 200) or if callbacks have waited for too
long (eg 100 ms). At the start of the GP all cpus are
polled if they are currently in a reader CS. Those that
are in a CS are asked to announce a QS once they exit
the CS. Therefore, the algorithm does not rely on observing
naturally occurring QSs such as context switches (as
in classic-rcu).

Note 1
------
Although the algorithm no longer requires the presence
of a regular scheduler tick on all cpus it still uses a
more-or-less regular timer to recognize callbacks waiting
too long for a new GP to start.

Bug?
----
The patch for (houston-rcu) omits a MB in rcu_read_lock().
Unfortunately, that would allow memory references from
the CS to extend beyond rcu_read_lock() and access stale
data in new readers. I added the MB in the pseudocode
below.

Pseudocode (just an outline - some details not included)
----------
reader_lock {
        disable preemption
        
        if ++nesting == 1
                atomic { flags = IN_CS }
                MB
}

reader_unlock {
        if --nesting == 0
                atomic_w_MB {
                        prev_flags = flags
                        flags = NOT_IN_CS
                }
                
                // writers are waiting for a QS
                if ANNOUNCE_QS & prev_flags
                        announce_qs()
        
        enable preemption
}

announce_qs {
        remove this cpu from wait_on_cpus
        if wait_on_cpus == none
                announce_gp_done()
}

call_rcu {
        add callback to next_cbs
        if added as first callback
                next_wait_for_gp = cur_gp + 1
                
        if 200 < next_cbs.count
                start_gp()
}

start_gp {
        // [snip] some important details scrapped
        
        wait_on_cpus = all
        ++cur_gp
        
        // poll all cpus if in a CS
        for all cpus
                // instruct unlock() to announce a QS if in CS
                prev = CAS(cpu.flags, IN_CS, IN_CS | ANNOUNCE_QS)
                // not executing a CS => already in a QS
                if prev == NOT_IN_CS
                        rm cpu from wait_on_cpus
        
        if wait_on_cpus == none
                announce_gp_done()
}


Podzimek's RCU (podzimek-rcu) [5,6]
==============
? non-preemptible reader sections
+ low overhead reader_lock/unlock (1 MB per GP and cpu)
+ no sampling scheduler ticks required
+ no scheduler instrumentations necessary
- GP latency depends on cache coherency
- forces context switches of long running threads
        without RCU readers
- reclaimers and detector serialized via a single mutex

Data
----
per-cpu: (no locks, only local cpu may modify)
last_seen_gp - the cpu noted an explicit QS (a MB)
        last for that GP number
wait_on_gp - GP num whose end we are waiting for to
        dispatch callbacks in cur_cbs (but it is not used
        that way; see discussion of race condition)
cur_cbs, next_cbs - callback queues
nesting_count - level of RCU CS nesting; 0 if not in CS

global: (modifications w/ mutex)
cur_gp - current grace period we are waiting for to end
req_gp_cnt - number of GPs to wait for after the current
        GP ends (tells the detector when to stop)

Algorithm (request QS/MB with ++gp)
---------
Podzimek does not depend on regular timer ticks. He
centralizes GP detection into a single detector thread.
Instead of sampling reader state each timer tick, the
detector explicitly requests that the first reader to
notice the start of a new GP on each cpu announces the
first encountered QS. Once all cpus announce a QS, the
GP ends.

The detector increments cur_gp in order to start a new
GP. Readers notice the new GP by comparing the changed
cur_gp to a locally stored value last_seen_gp. Readers
check for the change in unnested reader_lock/unlock as
these functions represent a QS that must be announced
for the new GP. The reader first executes a MB in order
to contain memory references within a CS (and to make
changes made by writers visible in the CS following
reader_lock()). Next, it notes that it reached a QS
by updating last_seen_gp to cur_gp.

The detector waits a while after starting a GP and
then polls each cpu's last_seen_gp and context switch
counter to see if it reached a QS. If a cpu did not
record a QS (might be a long running thread without
an RCU reader CS) a forced context switch is scheduled
on that cpu. Once all cpus pass through a QS callbacks
are executed in parallel on each cpu in separate
reclaimer threads.

Race condition?
---------------
Unless I am missing something the algorithm appears
to be suffering from a race condition between the
detector and reclaimer threads.

Detector's loop is:
        sleep
        ++cur_gp
        signal reclaimers a GP ended
        wait for preexisting readers

Instead of:
        sleep
        ++cur_gp
        wait for preexisting readers
        signal reclaimers a GP ended
        
As a result, in the following sequence, callback B()
is invoked before the preexisting reader R exits:
        (reclaimer)     (detector)      (reader)
        CPU 1           CPU 2           CPU 3
        next_cbs += A
        signal detector
        wait for detector
        (sleeps)
                        ++cur_gp
                        signal reclaimer
                        wait for readers
                                        R lock()
        (different thread)
        next_cbs += B
        (reclaimer wakes up)
        exec cur_cbs
        move next_cbs to cur_cbs
        signal detector
                        ++cur_gp
                        signal reclaimer
        exec cur_cbs (including B!!)    R still locked!!
                        wait for readers

Note 1
------
The algorithm relies on cpu's cache-coherency protocol
to deliver (make visible) the updated cur_gp to readers.
The same applies in the other direction when loading
last_seen_gp in the detector.

The cpu is not required to make the changes visible
to other cpus until it executes a MB. Therefore, in the
worst case, the detector will have to wait for the next
context switch. If a switch does not occur, the detector
will needlessly force one.

Bug
---
If I am not mistaken, the implementation [6] checks
for threads being idle or in user space with
CPU->cpu_mstate in OpenSolaris. However, OpenSolaris
does not surround changes of cpu_mstate with MBs (for
performance reasons [7]); therefore, reading its value
from the detector is racy (eg if a CS immediatelly
follows the change).

Pseudocode
----------
reader_lock {
        disable preemption
        check_qs
        ++nesting_count
}

reader_unlock {
        --nesting_count
        check_qs
        enable preemption
}

check_qs {
        if 0 < nesting_count
                // A QS has been requested as a new GP started.
                if last_seen_gp != cur_gp
                        gp = cur_gp
                        // Contain memory references in a CS.
                        // Makes writer changes visible if
                        // entering a CS
                        MB
                        // Ack that a MB was executed in a QS.
                        last_seen_gp = gp
}

call_rcu(callback) {
        with disabled preemption {
                add callback to next_cbs
                if added as first callback
                        wake up local reclaimer
                        
                check_qs
        }
}

// Each cpu has a bound reclaimer thread.
reclaimer threads {
        loop forever {
                // Wait for work to arrive.
                if cur_cbs.empty && next_cbs.empty
                        wait until !next_cbs.empty

                // Makes changes prior to enqueing each
                // of cur_cbs visible to other cpus.
                MB
                
                with locked mutex {
                        wait_on_gp = cur_gp + 1
                        
                        if 0 == req_gp_cnt
                                req_gp_cnt = 1
                                signal detector
                        
                        wait until (wait_on_gp <= cur_gp)
                }

                exec cur_cbs
                move next_cbs to cur_cbs
        }
}

// A single GP detection thread for the whole system.
detector thread {
        loop forever {
                with mutex locked {
                        wait until (0 < req_gp_cnt)
                        --req_gp_cnt
                        
                        // Start a new GP. Request a MB in a QS
                        // from readers.
                        ++cur_gp
                        
                        signal reclaimers
                }

                wait_for_readers
        }
}

wait_for_readers {
        reader_cpus = none
        
        // Gather potential readers in reader_cpus.
        for each cpu {
                may_be_reading = cpu is in kernel not idle
                        && cpu.last_seen_gp != cur_gp

                if may_be_reading
                        cpu.last_ctx_switch_cnt = cpu.ctx_switch_cnt
                        reader_cpus += cpu
        }

        // Batch other callbacks; give readers a chance
        // to finish.
        sleep(10ms)
        
        for each cpu in reader_cpus {
                may_be_reading = cpu is in kernel not idle
                        && cpu.last_ctx_switch_cnt == cpu.ctx_switch_cnt
                        // Still has not executed an explicit MB.
                        && cpu.last_seen_gp != glob_cur_gp
                        
                if may_be_reading
                        force and wait for a ctx switch on cpu
        }
}


Sleepable RCU (srcu) [8]
=============
+ readers may block/sleep and be preempted
+ low overhead reader_lock/unlock
+ allows waiting for a specific group of srcu readers
- large GP detection latency (>3x that of classic-rcu)
- depends on a preemption based RCU
- GP detection not amortized between domains/groups
        of readers

Data
----
per-thread: (no locks, only local thread reads/modifies)
reader_group - 0 or 1; identifies this thread with either
        preexisting or new readers once a GP detection begins
        
per-cpu: (no locks, only local cpu may modify)
reader_cnt[2] - number of active readers in each group

global: (modifications w/ mutex)
cur_reader_group - new readers are assigned to this group

Algorithm (per-cpu 2 counters w/ flip)
---------
Similarly to (sig-urcu), (srcu) tracks the number of
preexisting and new readers. However (srcu) runs in the
kernel so it can track readers per cpu instead of tracking
each thread separately. It does so with a pair of counters
per cpu - one for preexisting readers, the other for new
readers.

Current GP ends when the sum of the number of preexisting
readers on individual cpus drops to zero. Notice however
that due to thread migration a reader may increase the
reader_cnt on one cpu and decrement it on another cpu.
The sum of reader_cnt over all cpus nevertheless reflects
the number of readers in a group.

Unlike (sig-urcu), (srcu) does not suffer from having
to run two GP detections in a row as it can disable
preemption inside rcu_read_lock/unlock function and
utilize a preemption based RCU implementation already
present in the kernel. It also relies on this RCU to
ensure that readers execute MBs when waiting for a GP
to end. Eg if the underlying RCU implementation is
(classic-rcu), (srcu) will not force MBs in readers
like (sig-urcu) and will wait for them to naturally
occur (eg when switching contexts).

Although (srcu) does not wait for two consecutive GPs
to end like (sig-urcu), its GP detection latency can
be quite large since synchronize() has to wait for
three GPs of the underlying preemption-based RCU to
elapse.

Pseudocode
----------
reader_lock {
        with disabled preemption {
                reader_group = cur_reader_group
                this cpu's reader_cnt[reader_group]++
        }
}

reader_unlock {
        with disabled preemption {
                this cpu's reader_cnt[reader_group]--
        }
}

synchronize {
        with mutex locked {
                // Ensure changes made prior to invoking
                // synchronize() will be visible to new
                // readers, ie if a reader sees the flipped
                // cur_reader_group it is guaranteed to
                // see prior changes.
                preempt_synchronize()
                
                old_reader_group = cur_reader_group
                cur_reader_group = !cur_reader_group
                
                // Make sure new readers see and use
                // the new value of cur_reader_group
                preempt_synchronize()
                
                while readers_active(old_reader_group)
                        sleep(10 ms)
                
                // Ensure reader_unlock()s have completed
                // along with any mem accesses in the CS
                preempt_synchronize()
        }
}

readers_active(group) {
        for all cpus
                sum += cpu's reader_cnt[group]
        
        return 0 < sum
}


Quick sleepable RCU (qrcu) [9,10]
===================
+ readers may block/sleep and be preempted
+ very small GP detection latency
+ simple
+ allows waiting for a specific group of qrcu readers
- high overhead reader_lock/unlock (atomic ops
        on globally shared variables)

Data
----
per-thread: (no locks, only local thread reads/modifies)
reader_group - 0 or 1; identifies this thread with either
        preexisting or new readers once a GP detection begins
        
global: (modifications only with lock)
reader_cnt[2] - number of active readers in each group
cur_reader_group - new readers are assigned to this group


Algorithm (global 2 counters w/ flip)
---------
QRCU is optimized to very quickly detect the end of
GPs. Once the last preexisting reader exits its CS,
it immediately wakes synchronize() up.

The algorithm uses the same idea as (gen-urcu), namely
having two counters - one for preexisting readers and
the other for newly arriving readers. (qrcu) differs
in that it uses a single system-wide pair of counters
that readers adjust atomically. Therefore, one can
always quickly and accurately determine the number of
active readers (without having to sum per-cpu counters
or check all threads).

Note
----
Unlike in (gen-urcu), (sig-urcu) or (srcu), the number
of preexisting readers may increase if synchronize()
flips cur_reader_group right after reader_lock() loads
its old value but before reader_lock() gets a chance
to increment reader_cnt[preexisting readers group].
Such a race is rare and it is tolerated as long as
readers never raise a count of zero used to detect
the end of a GP. (qrcu) artifically keeps the new
readers group's counter greater by 1 than its value
should be to indicate that readers can safely use it,
ie as long as the count is positive it is safe to use.
Preexisting readers count reflects its correct value.
As a result, if that count drops to zero it stays zero.

Pseudocode
----------
reader_lock {
        while(true) {
                reader_group = cur_reader_group
                atomic_w_MB {
                        // Never increase a counter that indicates
                        // a GP ended if loaded old cur_reader_group
                        // in a race with synchronize()
                        if 0 < reader_cnt[reader_group]
                                ++reader_cnt[reader_group]
                                return
                }
        }
}

reader_unlock {
        if 0 == atomic_w_MB { --reader_cnt[reader_group] }
                wake up synchronize()
}

synchronize {
        with lock {
                old_group = cur_reader_group
                new_group = !cur_reader_group
                
                // Allow new readers to use new_group
                atomic_w_MB { ++reader_cnt[new_group] }

                cur_reader_group = new_group

                // Ensure that once reader_cnt[old_group]
                // drops to 0 it stays 0.
                atomic_w_MB { --reader_cnt[old_group] }
                
                sleep while 0 < reader_cnt[old_group]
        }
}


New sleepable RCU (srcu-2012) [11,12]
=================
+ readers may block/sleep and be preempted
+ small GP detection latency
+ may wait for a specific domain/group of readers
- readers always issue MBs
- GP detection not amortized between domains/groups
        of readers

Data
----
per-thread: (no locks, private to thread)
reader_group - 0 or 1; identifies this thread with
        either preexisting or new readers during GP detection
        
per-cpu: (no locks, only local cpu may modify)
reader_cnt[2] - number of active readers in each group
cs_seq[2] - sequence number of reader_lock()s (but not
        unlocks()) completed so far in each reader group

global: (spinlock protects modifications)
cur_reader_group - 0 or 1; new readers are assigned
        to this group
        
Algorithm (per-cpu 2 counters + sequence w/ flip)
---------
(srcu-2012) combines (srcu) and (gen-urcu). Like (srcu)
per-cpu counters are used to track the number of new
and preexisting readers. Exactly like in (gen-urcu),
modifications of these counters are separated from CSs
with MBs. However, unlike (gen-urcu) two group flips
are not required for a GP to elapse.

Two group flips were necessary in (gen-urcu) because
cur_reader_group may be flipped between reader_lock()
loads the value of cur_reader_group and increments its
nesting_count. Such a reader would incorrectly be
associated with the old/previous readers group even
though it is a new reader (sees all changes prior
to synchronize()) and has not been waited for. The
next synchronize() would not account for this reader
as it assumes there are no readers associated with the
previous readers group.

Instead of flipping cur_reader_group twice in a row,
(srcu-2012) maintains the invariant that synchronize()
finds the previous reader group empty upon entry by
explicitly waiting for such readers before flipping
cur_reader_group. Therefore, only a single group flip
is really needed. Moreover, the case described in the
paragraph above will be rare and one can expect that
there will be no delayed readers in the previous group
when entering synchronize(). As a result (srcu-2012)'s
GP length is effectively cut in half.

Pseudocode
----------
reader_lock {
        reader_group = cur_reader_group
        ++reader_cnt[reader_group]
        // Separate the CS from the above increment.
        MB_L
        ++cs_seq[reader_group]
}

reader_unlock {
        // Separate the CS from the decrement.
        MB_U
        --reader_cnt[reader_group]
}

synchronize {
        // Make changes prior to sync() visible to new
        // readers.
        MB
        
        old_reader_group = !cur_reader_group

        // Wait for the delayed readers that were
        // accidentally associated with the previous
        // reader group. Should be quick.
        wait_for_readers(old_reader_group)

        // Do not mix group flip with wait_for_readers
        MB

        cur_reader_group = !cur_reader_group
        old_reader_group = !cur_reader_group

        // Do not mix group flip with wait_for_readers
        MB
        
        wait_for_readers(old_reader_group)
}

// May not wait for readers of old_reader_group that
// start concurrently with wait_for_readers but those
// readers will definitely see all changes made prior
// to sync().
wait_for_readers(old_reader_group) {
        forever {
                prev_seq_sum = sum all cs_seq[old_reader_group]

                // Pairs with MB_L. If MB_L comes before MB_A,
                // update of reader_cnt will be incorporated
                // into sum below. If MB_L comes after MB_A,
                // the approx_active_cnt may not reflect the
                // update of reader_cnt in lock() and be equal
                // to 0. We won't wait for that reader, but the
                // reader's CS starts with MB_L so the reader
                // will see all changes made prior to MB_A
                // (ie also changes prior to sync()).
                MB_A
                
                approx_active_cnt =
                        sum all reader_cnt[old_reader_group]
                
                if 0 == approx_active_cnt {
                        // Pairs with MB_U. While summing reader_cnt
                        // we may have missed reader_cnt inc in lock()
                        // but we could have seen a dec of reader_cnt
                        // in an immediately following unlock(). MB_B
                        // makes the change of cs_seq in the CS visible
                        // to us here.
                        MB_B
                        
                        cur_seq_sum = sum all cs_seq[old_reader_group]
                        
                        // No readers exited while summing reader_cnt.
                        // approx_active_cnt reflects the number of
                        // readers that started before MB_A.
                        if cur_seq_num == prev_seq_num
                                return
                }
                        
                // There are preexisting readers.
                sleep(10ms)
        }       
}


Preemptible RCU (preempt-rcu-2007) [13,14,15]
===============
+ readers may be preempted (but cannot block)
+ moderately low overhead reader_lock/unlock
        (plenty of simple instructions)
+ all cpus detect GP in parallel
+ moderate overhead of GP detection (only observes
        never forces context switches, etc)
- complicated/cumbersome GP detection
- requires regular sampling timer/scheduling tick
- GP detection is held back if threads are alloted
        larger time quanta that they spend in the kernel

Algorithm & Data
----------------
See [13] for details.


End
===
Extensive overview of available RCU algorithms was
given. We now know what is out there.

An upcoming email will (succinctly ;-)) cover which
combination of algorithms I would implement.

Adam


References
==========
[1] User-level implementations of read-copy update,
        2012, appendix
        
http://www.rdrop.com/users/paulmck/RCU/urcu-supp-accepted.2011.08.30a.pdf

[2] Exploiting deferred destruction - an analysis
        of read-copy-update techniques,
        Section 4.4.1
        2004, McKenney, Phd thesis
        http://www.rdrop.com/users/paulmck/RCU/RCUdissertation.2004.07.14e1.pdf

[3] Hierarchical rcu,
        2008, McKenny
        http://lwn.net/Articles/305782/
        
[4] [RFC&PATCH] Alternative RCU implementation],
        2004, Houston
        https://lkml.org/lkml/2004/8/30/87
        
[5] Read-copy-update for opensolaris,
        2010, Podzimek
        https://andrej.podzimek.org/thesis.pdf

[6] (podzimek-rcu) implementation file "rcu.patch"
        http://d3s.mff.cuni.cz/projects/operating_systems/rcu/rcu.patch

[7] #define NEW_CPU_MSTATE in OpenSolaris
        
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/msacct.c#152

[8] Sleepable read-copy-update,
        2008, McKenney
        http://www.rdrop.com/users/paulmck/RCU/srcu.2007.01.14a.pdf
        
[9] Using Promela and Spin to verify parallel algorithms,
        2007, McKenney
        http://lwn.net/Articles/243851/
        
[10] QRCU with lockless fastpath
        2007, McKenney
        http://lwn.net/Articles/223752/
        
[11] linux/kernel/srcu.c in Linux 3.5-rc2,
        2012
        
http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/kernel/srcu.c?v=linux-3.5-rc2-ccs-1.8.3

[12] [RFC PATCH 5/5 single-thread-version] implement
        per-domain single-thread state machine,
        2012, Lai
        https://lkml.org/lkml/2012/3/6/586
        
[13] The design of preemptible read-copy update,
        2007, McKenney
        http://lwn.net/Articles/253651/
        
[14] [PATCH RFC 3/9] RCU: Preemptible RCU,
        2007, McKenney
        https://lkml.org/lkml/2007/9/10/216

[15] Extending rcu for realtime and embedded workloads,
        2006, McKenney et al
        http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf

[16] [PATCH RFC -tip 4/4] Merge preemptable-RCU
        functionality into hierarchical RCU,
        2009, McKenney
        https://lkml.org/lkml/2009/7/23/303

[17] linux/kernel/rcutree_plugin.h in Linux 3.5-rc2
        2012
        
tomoyo.sourceforge.jp/cgi-bin/lxr/source/kernel/rcutree_plugin.h?v=linux-3.5-rc2-ccs-1.8.3

[18] Memory barriers: a hardware view for software hackers,
        2010, McKenney
        http://irl.cs.ucla.edu/~yingdi/paperreading/whymb.2010.06.07c.pdf

[19] A low-overhead coherence solution for multiprocessors
        with private cache memories,
        1984, Papamarcos; MESI protocol
        http://dl.acm.org/citation.cfm?id=808204        

[20] What every programmer should know about memory,
        2007, Drepper
        http://www.akkadia.org/drepper/cpumemory.pdf

_______________________________________________
HelenOS-devel mailing list
[email protected]
http://lists.modry.cz/cgi-bin/listinfo/helenos-devel

[HelenOS-devel] RCU algorithm review

Reply via email to