Hi Andrej, Thank you very much for taking the time to look at my comments regarding your algorithm.
I am afraid I still do not understand how the race can be prevented. I am probably stuck and -- sorry to say this -- you are most likely the only one who can get me unstuck ;-). I would be grateful if you had a look at this email as well and pointed out my logical error. On Sun, Jun 17, 2012 at 1:22 PM, Martin Decky <[email protected]> wrote: > Hello Adam and all, > > please see below a reply from Andrej Podzimek commenting in much detail on > your observations of his RCU implementation. > > > M.D. > > > -------- Original Message -------- > Subject: Re: Fwd: [HelenOS-devel] RCU algorithm review > Date: Sun, 17 Jun 2012 01:08:27 +0200 > From: Andrej Podzimek <[email protected]> > To: Martin Decky <[email protected]> > >> Race condition? >> --------------- >> Unless I am missing something the algorithm appears >> to be suffering from a race condition between the >> detector and reclaimer threads. >> >> Detector's loop is: >> sleep >> ++cur_gp >> signal reclaimers a GP ended >> wait for preexisting readers >> >> Instead of: >> sleep >> ++cur_gp >> wait for preexisting readers >> signal reclaimers a GP ended >> >> As a result, in the following sequence, callback B() >> is invoked before the preexisting reader R exits: >> (reclaimer) (detector) (reader) >> CPU 1 CPU 2 CPU 3 >> next_cbs += A >> signal detector >> wait for detector >> (sleeps) >> ++cur_gp >> signal reclaimer >> wait for readers >> R lock() >> (different thread) >> next_cbs += B >> (reclaimer wakes up) >> exec cur_cbs >> move next_cbs to cur_cbs >> signal detector >> ++cur_gp >> signal reclaimer > > > ^^^ The second "signal reclaimer" call is a wrong assumption, it cannot > happen here. The detector will *not* announce a GP boundary while CPU 3 is > still inside a read-side critical section that had started *before* the > reclaimer on CPU 1 signalled the detector. Let's see when the detector can > actually announce the GP boundary: > > A) CPU3 would have to announce that it had already observed the *last* > ++cur_gp (by assigning the new value to its local copy of cur_gp). Well, but > this did *not* happen yet -- there has been no R unlock() and read-side > sections in interrupt handlers obey the nesting counter (just like any > possible nested R (un)lock() calls in general). So CPU 3 will have a stale > copy of the GP counter and the detector will have to wait for it. Hmm, is that really the case? "R lock()" started after the detector finished waiting for readers (the line "wait for readers" right above "R lock()") and went to sleep waiting for more work to arrive (cv_wait(rcu_wanted_cond)). I do not see the detector waiting for readers to finish between the "++rcu_cb.rcu_gp_ctr" and signaling reclaimers: Line 4047-4062 of rcu_detector(): + for (;;) { + while (!rcu_cb.rcu_gp_wanted) { + CALLB_CPR_SAFE_BEGIN(&cprinfo); + cv_wait(&rcu_cb.rcu_wanted_cond, &rcu_cb.rcu_gp_mutex); + CALLB_CPR_SAFE_END(&cprinfo, &rcu_cb.rcu_gp_mutex); + } + + --rcu_cb.rcu_gp_wanted; + ++rcu_cb.rcu_gp_ctr; .. "++cur_gp" in pseudocode + cv_broadcast( + &rcu_cb.rcu_elapsed_cond[rcu_cb.rcu_gp_ctr & 0x1]); .. "signal reclaimer" Oops, we signaled the reclaimer before waiting for readers to finish. + + mutex_exit(&rcu_cb.rcu_gp_mutex); + rcu_wait_and_force(); + mutex_enter(&rcu_cb.rcu_gp_mutex); + } Let us look at the actual source code [6] and let me trace the steps of the race in real code. CPU_1 = reclaimer CPU_2 = detector CPU_3 = idle, eventually the reader I will serialize the code; additional tabs help identify which cpu we are running on. My notes are prefixed with "//", and state changes are in "[ var = new_value ]". First, I present a shorter version, similar to the pseudo- code, but with real code. A full/verbose version follows afterwards if you are interested. Short ----- [ Starting state: rcu_cb.rcu_gp_ctr = 1 rcu_cb.rcu_gp_wanted = 0 rcu_cb.rcu_gp_mutex = unlocked rcu_cb.rcu_elapsed_cond[0..1] = obviously, NOT signaled rcu_cb.rcu_hiwater = 0, but does not really matter all cpu's: cpu.rcu_queue_sema = 0 cpu.rcu_gp_ctr = 0, but does not matter (could be 1) cpu.curlist = empty cpu.nextlist = empty ] (CPU_3 is in idle loop) (CPU_2) // rcu_detector() waits for work to arrive at line 4050: + cv_wait(&rcu_cb.rcu_wanted_cond, &rcu_cb.rcu_gp_mutex); (CPU_1 reclaimer thread) // rcu_reclaimer() waits for work at line 4142: + sema_p(&RCU_CPU(cp, rcu_queue_sema)); (CPU_1 ordinary thread in kernel) rcu_call(A); [ cpu1.nextlist = {A} cpu1.rcu_queue_sema = 1 ] (CPU_1 switches back to reclaimer thread) [ rcu_synchronize_common()'s gp_ctr = 2 == rcu_cb.rcu_gp_ctr + 1 cpu1.rcu_queue_sema = 0 rcu_cb.rcu_gp_wanted = 1 ] // Signals rcu_cb.rcu_wanted_cond and waits for // elapsed[even == gp_ctr & 1 == 2 & 1] at line 3815: + cv_wait(&rcu_cb.rcu_elapsed_cond[gp_ctr & 0x1], + &rcu_cb.rcu_gp_mutex); (CPU_1 switches to another thread due to cv_wait) (CPU_2 detector wakes up due to signal) // rcu_cb.rcu_gp_wanted == 1 and rcu_wanted_cond // signaled, so lines 4050-4059 of rcu_detector(): + cv_wait(&rcu_cb.rcu_wanted_cond, &rcu_cb.rcu_gp_mutex); + CALLB_CPR_SAFE_END(&cprinfo, &rcu_cb.rcu_gp_mutex); + } + + --rcu_cb.rcu_gp_wanted; + ++rcu_cb.rcu_gp_ctr; [ rcu_cb.rcu_gp_wanted = 0 rcu_cb.rcu_gp_ctr = 2 ] + cv_broadcast( + &rcu_cb.rcu_elapsed_cond[rcu_cb.rcu_gp_ctr & 0x1]); // .. signaling elapsed[even == 2 & 1], but reclaimer // may not be immediatelly scheduled. It may take a while. // // CPU_2 now runs and completes rcu_wait_and_force(). // Notice that there are no readers. CPU_3 is idle and // CPU_1 could be running eg in user mode. But even if // there were readers, we do not care at this point. (CPU_3 reader thread appears and runs) rcu_read_lock(); // .. executes a MB, updates cpu3.rcu_gp_ctr [ cpu3.rcu_gp_ctr = 2 == rcu_cb.rcu_gp_ctr cpu3's preemption is now disabled ] (CPU_1 ordinary thread in kernel) rcu_call(B); // .. B() is added after CPU_3's reader starts, so we // must wait for the reader to exit its critical section // before invoking B(). [ cpu1.nextlist = {A, B} ] (CPU_1 time slice elapsed, switch to reclaimer) // Executes callbacks in cpu1.curlist=={} and moves // cpu1.nextlist to cpu1.curlist. [ cpu1.curlist = {A, B} cpu1.nextlist = {} ] // New loop iteration: there is work to do, so skip // sema_down(), signal the detector... [ rcu_reclaimer() local gp_ctr = 2 == rcu_cb_rcu_gp_ctr rcu_cb.rcu_gp_wanted = 1 ] // ...and wait for elapsed[odd == (gp_ctr & 1) == (3 & 1)] // at line 3815 of rcu_synchronize_common(1): [ rcu_synchronize_common()'s gp_ctr = 3 == rcu_cb.rcu_gp_ctr + 1 ] + cv_wait(&rcu_cb.rcu_elapsed_cond[gp_ctr & 0x1], + &rcu_cb.rcu_gp_mutex); (CPU_2 detector wakes up due to signal) // rcu_cb.rcu_gp_wanted == 1 and rcu_wanted_cond // signaled, so lines 4050-4059 of rcu_detector(): + cv_wait(&rcu_cb.rcu_wanted_cond, &rcu_cb.rcu_gp_mutex); + CALLB_CPR_SAFE_END(&cprinfo, &rcu_cb.rcu_gp_mutex); + } + + --rcu_cb.rcu_gp_wanted; + ++rcu_cb.rcu_gp_ctr; [ rcu_cb.rcu_gp_wanted = 0 rcu_cb.rcu_gp_ctr = 3 ] + cv_broadcast( + &rcu_cb.rcu_elapsed_cond[rcu_cb.rcu_gp_ctr & 0x1]); // .. signaling elapsed[odd == 3 & 1], so reclaimer // can run (btw, the detector releases the mutex on the // next line, so reclaimer really can run ;-)). // // The rest of the detector is irrelevant. (CPU_1 reclaimer wakes up due to signal) // rcu_synchronize_common(1) receives the signal for // elapsed[odd], returns to rcu_reclaimer() which // proceeds to call the callbacks in cpu1.curlist=={A,B}. // Invoking B() even though the reader is running. // And that's a bug. Verbose ------- Now the verbose trace of the race. [ Starting state: rcu_cb.rcu_gp_ctr = 1 rcu_cb.rcu_gp_wanted = 0 rcu_cb.rcu_gp_mutex = unlocked rcu_cb.rcu_elapsed_cond[0..1] = obviously, NOT signaled rcu_cb.rcu_hiwater = 0, but does not really matter all cpu's: cpu.rcu_queue_sema = 0 cpu.rcu_gp_ctr = 0, but does not matter (could be 1) cpu.curlist = empty cpu.nextlist = empty ] (CPU_3 is in idle loop) [ No change in state. ] (CPU_2) // Lines 4047-4050 of rcu_detector(): + mutex_enter(&rcu_cb.rcu_gp_mutex); + rcu_cb.rcu_cpr_events = &cprinfo.cc_events; + rcu_cb.rcu_proc = ttoproc(curthread); + cv_broadcast(&rcu_cb.rcu_elapsed_cond[0]); + + cmn_err(CE_CONT, "RCU: detector starting\n"); + for (;;) { + while (!rcu_cb.rcu_gp_wanted) { + CALLB_CPR_SAFE_BEGIN(&cprinfo); + cv_wait(&rcu_cb.rcu_wanted_cond, &rcu_cb.rcu_gp_mutex); // Detector sleeps waiting for work [ No change in state. rcu_cb.rcu_gp_mutex again unlocked. ] (CPU_1 reclaimer thread) // Lines 4130-4142 of rcu_reclaimer(): + RCU_CPU(cp, rcu_gp_ctr) = rcu_cb.rcu_gp_ctr; + + for (;;) { + if (!(RCU_CPU(cp, rcu_nextlist) || RCU_CPU(cp, rcu_curlist))) { + do { + if (RCU_CPU(cp, rcu_sync)) + rcu_handle_sync(cp); + if (RCU_STOP == RCU_CPU(cp, rcu_thread_state)) + goto reclaimer_terminate; + + CALLB_CPR_SAFE_BEGIN(&cprinfo); + mutex_exit(&RCU_CPU(cp, rcu_cpr_mutex)); + sema_p(&RCU_CPU(cp, rcu_queue_sema)); // Reclaimer sleeps waiting for work, ie sema to become // positive. (CPU_1 Switch to ordinary thread in kernel) rcu_call(A()); // Lines 3346-3350 of rcu_call() invoke of rcu_call_common(), // lines 3873-3898: +rcu_call_common(rcu_t *arg, rcu_weight_t weight) +{ + cpu_t *cp; + /* comment snipped */ + membar_exit(); + kpreempt_disable(); + cp = CPU; + /* comment snipped */ + if (rcu_enqueue_callback(cp, arg, &arg->rcu_next)) + sema_v(&RCU_CPU(cp, rcu_queue_sema)); // First callback enqueued, so sema is up()ed. // Reclaimer can run once preemption is enabled. [ cpu1.rcu_queue_sema = 1 cpu1.rcu_nextlist = {A} cpu1.rcu_nexttail = &A.next ] // Lines 3899-3902 of rcu_call_common(): + RCU_ADD(RCU_CPU(cp, rcu_weight), weight); + rcu_announce_qs(cp); // ...executes a MB and updates cpu1.rcu_gp_ctr [ cpu1.rcu_gp_ctr = 1 ] + kpreempt_enable(); +} (CPU_1 Switch to reclaimer thread waiting for sema) // Lines 4142-4184 of rcu_reclaimer(): + sema_p(&RCU_CPU(cp, rcu_queue_sema)); // ...can proceed, since sema was up()ed to 1. [ cpu1.rcu_queue_sema = 0, again ] + mutex_enter(&RCU_CPU(cp, rcu_cpr_mutex)); + CALLB_CPR_SAFE_END(&cprinfo, + &RCU_CPU(cp, rcu_cpr_mutex)); + + } while (!RCU_CPU(cp, rcu_nextlist)); + } else { + /*comment snipped*/ + // membar_enter(); + membar_consumer(); + } + + gp_ctr = rcu_cb.rcu_gp_ctr; + + mutex_enter(&rcu_cb.rcu_gp_mutex); // ...rcu_cb.rcu_gp_mutex was unlocked so we can enter. [ rcu_cb.rcu_gp_mutex = locked gp_ctr = 1 ] + if (rcu_cb.rcu_gp_ctr - gp_ctr >= 2) { + goto elapsed; + } + /* comment snipped */ + /* CPR related while-loop removed */ + if (RCU_CPU(cp, rcu_weight) > rcu_weight_max) { + ++rcu_cb.rcu_hiwater; + cv_signal(&rcu_cb.rcu_hiwater_cond); + rcu_synchronize_common(1); + --rcu_cb.rcu_hiwater; + } else { + rcu_synchronize_common(1); // Lines 3802-3816 of rcu_synchronize_common(): +rcu_synchronize_common(gp_t periods) +{ + gp_t gp_ctr = rcu_cb.rcu_gp_ctr + periods; [ gp_ctr = 1 + 1 = 2 ] + /* asserts removed */ + if (rcu_cb.rcu_gp_wanted < periods) { + rcu_cb.rcu_gp_wanted = periods; + cv_signal(&rcu_cb.rcu_wanted_cond); // rcu_cb.rcu_wanted_cond signaled, detector gets to run // once we release rcu_cb.rcu_gp_mutex. [ rcu_cb_rcu_gp_wanted = 1 ] + } + + while (RCU_GP_GT(gp_ctr, rcu_cb.rcu_gp_ctr)) + cv_wait(&rcu_cb.rcu_elapsed_cond[gp_ctr & 0x1], + &rcu_cb.rcu_gp_mutex); // Waiting on elapsed[even == gp_ctr & 1 == 2 & 1] // because 2 == gp_ctr > rcu_cb.rcu_gp_ctr == 1. [ rcu_cb.rcu_gp_mutex = unlocked; because waiting for CV. ] (CPU_1 is eg in user mode, since reclaimer is sleeping) (CPU_2 detector awakens) // Lines 4050-4059 of rcu_detector(): + cv_wait(&rcu_cb.rcu_wanted_cond, &rcu_cb.rcu_gp_mutex); + CALLB_CPR_SAFE_END(&cprinfo, &rcu_cb.rcu_gp_mutex); + } + + --rcu_cb.rcu_gp_wanted; + ++rcu_cb.rcu_gp_ctr; [ rcu_cb.rcu_gp_wanted = 0 rcu_cb.rcu_gp_ctr = 2 rcu_cb.rcu_gp_mutex = locked ] + cv_broadcast( + &rcu_cb.rcu_elapsed_cond[rcu_cb.rcu_gp_ctr & 0x1]); // Signaling elapsed[even == 2 & 1] + + mutex_exit(&rcu_cb.rcu_gp_mutex); [ rcu_cb.rcu_gp_mutex = unlocked ] // Now reclaimer can run again (because elapse[even] // was signaled) once it is scheduled and reaquires // the mutex, but that may take a while. + rcu_wait_and_force(); // Lines 4209-4306 or rcu_wait_and_force(): +rcu_wait_and_force() +{ + cpu_t *cp, *first; + cpu_t *list_first; + cpu_t **list_tail = &list_first; + /* comment */ + mutex_enter(&cpu_lock); + first = CPU; + cp = first; + do { + if (RCU_GP_GT(RCU_CPU(cp, rcu_gp_ctr), rcu_cb.rcu_gp_ctr)) + RCU_CPU(cp, rcu_gp_ctr) = rcu_cb.rcu_gp_ctr - 1; + if ( + rcu_cb.rcu_gp_ctr != RCU_CPU(cp, rcu_gp_ctr) && + cp->cpu_mstate == CMS_SYSTEM && + cp->cpu_thread != curthread + ) { + RCU_CPU(cp, rcu_pswitch) = CPU_STATS(cp, sys.pswitch); + *list_tail = cp; + list_tail = &RCU_CPU(cp, rcu_next_cpu); + } + } while ((cp = cp->cpu_next_onln) != first); + *list_tail = NULL; // CPU_1 was in user space, CPU_3 is idle and CPU_2 is us, // so there are no/zero cpus to wait for. + /*..*/ + if (rcu_cb.rcu_hiwater) { + /*..*/ + membar_consumer(); + } else { + /*..*/ + mutex_exit(&cpu_lock); + mutex_enter(&rcu_cb.rcu_gp_mutex); // .. was unlocked, so ok to enter. [ rcu_cb.rcu_gp_mutex = locked ] + if (!rcu_cb.rcu_hiwater) { + clock_t timeleft = rcu_gp_max; + do { + timeleft = cv_reltimedwait( + &rcu_cb.rcu_hiwater_cond, + &rcu_cb.rcu_gp_mutex, timeleft, + TR_MILLISEC); // .. let us pretend CPU_1 does not schedule the reclaimer // and CPU_3 remains idle while we sleep here (but even // if that were not the case, hiwater might be positive // and we would not even sleep here). + } while (!(-1 == timeleft || rcu_cb.rcu_hiwater)); + } + mutex_exit(&rcu_cb.rcu_gp_mutex); [ rcu_cb.rcu_gp_mutex = unlocked ] + mutex_enter(&cpu_lock); + } + /*..*/ + for (cp = list_first; cp; cp = RCU_CPU(cp, rcu_next_cpu)) { + /*..*/ + if ( + cp->cpu_next && cpu_is_online(cp) && + cp->cpu_mstate == CMS_SYSTEM && + rcu_cb.rcu_gp_ctr != RCU_CPU(cp, rcu_gp_ctr) && + RCU_CPU(cp, rcu_pswitch) == CPU_STATS(cp, sys.pswitch) + ) { + affinity_set(cp->cpu_id); + affinity_clear(); + } + } // .. nothing to check, all cpus were idle or in user mode. + mutex_exit(&cpu_lock); +} // Return to rcu_detector(), lines 4047-4050: + for (;;) { + while (!rcu_cb.rcu_gp_wanted) { + CALLB_CPR_SAFE_BEGIN(&cprinfo); + cv_wait(&rcu_cb.rcu_wanted_cond, &rcu_cb.rcu_gp_mutex); // rcu_gp_wanted == 0 so we wait for more work. [ rcu_cb.rcu_gp_mutex = unlocked, again ] (CPU_3 from idle to reader thread) // Lines 3325-3331 or rcu_read_lock(): +rcu_read_lock() +{ + cpu_t *cp; + + kpreempt_disable(); + cp = CPU; + rcu_announce_qs(cp); // Executes a MB and update local rcu_gp_ctr [ cpu3's preemption disabled from now on cpu3.rcu_gp_ctr = 2 ] // Stays in critical section accessing stuff. (CPU_1 ordinary thread; reclaimer waiting to be scheduled) // !! Notice the following callback B() should wait // for reader in CPU_3 to complete. rcu_call(B()); // ..queues the second callback to nextlist, rcu_announce_qs() // does nothing, since cpu1.rcu_gp_ctr == rcu_cb.rcu_gp_ctr, // but that's irrelevant. [ cpu1.rcu_nextlist = {A, B} cpu1.rcu_nexttail = &B.next ] (CPU_1 ordinary thread preempted; back to reclaimer) // Lines 3813-3817 of rcu_synchronize_common(1): + while (RCU_GP_GT(gp_ctr, rcu_cb.rcu_gp_ctr)) + cv_wait(&rcu_cb.rcu_elapsed_cond[gp_ctr & 0x1], + &rcu_cb.rcu_gp_mutex); +} // Reclaimer was waiting for elapsed[even], that was // signaled by detector and 2 == gp_ctr == rcu_cb.rcu_gp_ctr, // so let's continue (rcu_cb.rcu_gp_mutex is unlocked) and // return back to rcu_reclaimer(). [ rcu_cb.rcu_gp_mutex = locked ] // Lines 4186-4189 of rcu_reclaimer(): +elapsed: mutex_exit(&rcu_cb.rcu_gp_mutex); + + rcu_advance_callbacks(cp); + } // rcu_advance_callbacks() executes cpu1.curlist (which // is empty, so no-op) and moves cpu1.nextlist to curlist. [ rcu_cb.rcu_gp_mutex = unlocked cpu1.rcu_curlist = {A, B} cpu1.rcu_nextlist = {} ] // rcu_reclaimer() starts another for(;;) loop iteration. // Lines 4132-4184: + for (;;) { + if (!(RCU_CPU(cp, rcu_nextlist) || RCU_CPU(cp, rcu_curlist))) { // Branch skipped since curlist non-empty. + } else { + /*..*/ + // membar_enter(); + membar_consumer(); + } + + gp_ctr = rcu_cb.rcu_gp_ctr; + + mutex_enter(&rcu_cb.rcu_gp_mutex); [ gp_ctr = 2, since 2 == rcu_cb.rcu_gp_ctr rcu_cb.rcu_gp_mutex = locked ] + if (rcu_cb.rcu_gp_ctr - gp_ctr >= 2) { + goto elapsed; + } + /*..*/ + /* CPR related while-loop removed */ + if (RCU_CPU(cp, rcu_weight) > rcu_weight_max) { + ++rcu_cb.rcu_hiwater; + cv_signal(&rcu_cb.rcu_hiwater_cond); + rcu_synchronize_common(1); + --rcu_cb.rcu_hiwater; + } else { + rcu_synchronize_common(1); // Lines 3802-3816 of rcu_synchronize_common(1): +rcu_synchronize_common(gp_t periods) +{ + gp_t gp_ctr = rcu_cb.rcu_gp_ctr + periods; [ gp_ctr = 2 + 1 = 3 ] + /* asserts removed */ + if (rcu_cb.rcu_gp_wanted < periods) { + rcu_cb.rcu_gp_wanted = periods; + cv_signal(&rcu_cb.rcu_wanted_cond); // rcu_cb.rcu_wanted_cond signaled, detector gets to run // once we release rcu_cb.rcu_gp_mutex. [ rcu_cb.rcu_gp_wanted = 1 ] + } + + while (RCU_GP_GT(gp_ctr, rcu_cb.rcu_gp_ctr)) + cv_wait(&rcu_cb.rcu_elapsed_cond[gp_ctr & 0x1], + &rcu_cb.rcu_gp_mutex); // Waiting on elapsed[odd == gp_ctr & 1 == 3 & 1] // because 3 == gp_ctr > rcu_cb.rcu_gp_ctr == 2. [ rcu_cb.rcu_gp_mutex = unlocked, because of cv_wait() ] (CPU_2 detector receives the CV signal) // rcu_detector() wakes up; lines 4050-4059: + cv_wait(&rcu_cb.rcu_wanted_cond, &rcu_cb.rcu_gp_mutex); + CALLB_CPR_SAFE_END(&cprinfo, &rcu_cb.rcu_gp_mutex); + } + + --rcu_cb.rcu_gp_wanted; + ++rcu_cb.rcu_gp_ctr; [ rcu_cb.rcu_gp_wanted = 0 rcu_cb.rcu_gp_ctr = 3 ] + cv_broadcast( + &rcu_cb.rcu_elapsed_cond[rcu_cb.rcu_gp_ctr & 0x1]); // Signal elapsed[odd == (gp_ctr & 1) == (3 & 1)] + + mutex_exit(&rcu_cb.rcu_gp_mutex); [ rcu_cb.rcu_gp_mutex = unlocked ] // At this point we don't care what CPU_2 does anymore. (CPU_1 wakes up reclaimer from cv_wait) // Detector signaled elapsed[odd] that we have been waiting // for and also released the mutex. // Lines 3814-3817 of rcu_synchronize_common(1): + while (RCU_GP_GT(gp_ctr, rcu_cb.rcu_gp_ctr)) + cv_wait(&rcu_cb.rcu_elapsed_cond[gp_ctr & 0x1], + &rcu_cb.rcu_gp_mutex); +} // .. 3 == gp_ctr == rcu_cb.rcu_gp_ctr == 3, so loop exits. [ rcu_cb.rcu_gp_mutex = locked ] // Returns back to rcu_reclaimer() lines 4185-4188: + } +elapsed: mutex_exit(&rcu_cb.rcu_gp_mutex); + [ rcu_cb.rcu_gp_mutex = locked ] + rcu_advance_callbacks(cp); // .. executes cpu1.curlist == {A, B}, ie B() is invoked // even though CPU_3's reader did not yet exit!! > B) Of course, the interleaving of instructions cannot be perceived as > something constant ... so let's assume, just for fun, that R lock() occured > *after* the last ++cur_gp, as observed by some other thread. Then R lock() > must have set CPU 3's local copy of cur_gp to the last value and the > detector will not wait for CPU 3. But again, this is not a problem, since > then CPU 3 had already observed the changes that resulted in issuing > callback B *before* starting the read-side section (and called the > conditional barrier in R lock()). No argument there. > [..] It is therefore perfectly fine to execute > callback B in parallel with CPU 3's read-side section. Unfortunately, this is not the case in the race I presented. > > C) The detector will simply wait until CPU 3 either announces the counter > observation at R unlock() or context-switches or gets idle or switches to > userspace... If the detector waits for too long (more than a small N of > lbolt ticks) or gets a high water mark announcement (from reclaimers), then > it will try to forcibly context-switch itself to CPU 3 ... which will *not* > succeed before the non-preemptible read-side section is over. Agreed with (C). (However, we're dealing with (A)) > > >> exec cur_cbs (including B!!) R still locked!! >> wait for readers > > > To sum up, maybe I am missing something, but I really cannot see a race > condition here. > > If there was a race condition of this magnitude, the hash table benchmarks > (combined with thorough tests using bit patterns) would probably crash after > milliseconds. However, they have neither crashed nor observed any > inconsistent data modifications so far. (Admittedly, they have only run for > about two weeks on all the machines combined so far...) ;-). You can increase the likelyhood of demonstrating the race if you test with hiwater mark set (so that the detector does not sleep). Moreover, implementation details of condition variables on OpenSolaris may influence how often you encounter the race in practice. If you want to see the race in action you could insert a rcu_call into rcu_advance_callbacks() (perfectly valid due to enabled preemption and interrupts). The following should work even on a uniprocessor and *without* a high water mark. Something in the spirit of: // Wait for the detector and reclaimers to reach // initial position. demonstrate_race = 0 sleep(1 second) // Single normal rcu_call during the whole test. demonstrate_race = 1 rcu_call(A) rcu_reclaimer() { // .. if(1 == demonstrate_race) { rcu_read_lock(); // Add B() at the right time. rcu_call(B) ASSERT(nextlist == {A, B}) // Don't lock or add B next time. demonstrate_race = 2 } rcu_advance_callbacks // .. } B() { panic("Whoops, the race is real") } If there is no race, the test will deadlock. Otherwise, it will panic. > Anyway, it is *great* that someone spends time and effort on exploring the > algorithm and searches for possible shortcomings. I have found a bug in the > hash table implementation approximately 6 months ago. The bug only occurred > under the most aggressive experimental optimization mode of the Intel > compiler. One single volatile cast fixed the bug. Yet one year earlier, I > had been convinced that no inconsistency of that kind could ever be > observed... (Notes about this interesting issue can still be found in the > comments.) So again, code reviews are really important and many thanks for > this one. It was my pleasure reading your implementation. I am not going to lie: it is a *very* clever algorithm. Well done! > >> + low overhead reader_lock/unlock (1 MB per GP and cpu) > > > The overhead might be higher than that, unfortunately, although I'm not sure > abou this. The MB should occur only when the global GP counter changes ... > theoretically. But in practice, the CPU may issue the MB speculatively, just > to make sure it can safely take the alternative branch at any time. Hard to > guess whether this actually happens. > > Linux uses prediction hints (defined as macros replacing the if statements) > that are somehow reflected in the machine code (on platforms that support > them, e.g., SPARC has those probably taken / probably not taken bits that > can be set in each branch instruction). Perhaps this could help. In such > case the fast path could be marked as "probably taken", in an attempt to > prevent the CPU from issuing the MB speculatively. > > But this is just a wild guess. Someone would have to define those special if > statements (so far UTS only uses standard C if, no macro replacements > thereof) and *benchmark* it. AFAIK, perhaps some alternative if statements > are supported by GCC directly (and GCC is now the default compiler for > Illumos). > > >> - GP latency depends on cache coherency > > > There is a reasonable upper bound on GP duartion as long as all read-side > sections behave correctly (and they should -- we are in the f...ing kernel). > So the bare fact that the average / expected / whatever latency is > influenced by UMA/NUMA and/or cache coherency protocols is not that > disastrous. > > In some cases, longer GP durations save a lot of context-switching effort by > batched reclamation processing. > CC-latency definitely does not influence the correctness of the algorithm. However, it does significantly influence its performance. While I suspect (as you seem to as well ;-)) CC-latency will not be a problem in practice, I have very little experience with CC behavior, so I thought I would point it out. > >> - forces context switches of long running threads >> without RCU readers > > > True. But this does not occur too often (from a machine's perspective). > > It would be nice if some threads could be explicitly marked as "free of RCU > readers". The idle thread is (implicitly) marked that way right now and so > is all userspace processing. :-) > > Unfortunately, an attempt to set a "free of RCU readers" flag on each R > unlock() would (as one can imagine) lead to ugly race conditions, so all > ideas of this kind need to be considered very carefully. > > Long-running kernel threads would (of course) be interrupted by the > detector, but *most* long-running threads run in userspace (and so the > detector does not need to take them into account and interrupt their CPUs at > all). Long-running threads performing computations are *rare* in the kernel. > Some of them are related to ZFS, but they use the SYSDC scheduling class, > which means that they can be preempted by any other threads (including > userspace threads) quite often. > > >> - reclaimers and detector serialized via a single mutex > > > A mutex *and* a condition variable, which is slightly more scalable than a > mutex alone. Furthermore, reclaimers process potentially long batches of > callbacks with the mutex unlocked and the detector also keeps it unlocked > most of the time. The detector prolongs grace period duration on purpose > (when there is no high water announcement) by a small N of lbolt ticks, > which reduces the frequency of competition for the mutex even further. > > However, it is definitely true that a better solution *must* be found for > large NUMA systems. It would be nice to have one detector per NUMA node and > a tree-based GP announcement protocol (skipping inactive nodes (of course!) > and supporting multiple instances of the mechanism, thus keeping the GP > detection as local as possible, not competing with unrelated threads from > distant NUMA nodes). I fully agree with you. I nevertheless wanted to at least note the usage of a single mutex (I reviewed other rcu algorithms as well, so it would be unfair not to bring up). >> wait_on_gp - GP num whose end we are waiting for to >> dispatch callbacks in cur_cbs (but it is not used >> that way; see discussion of race condition) > > > It's important to understand this basic idea correctly. A callback does > *not* wait for the "end" of a particular GP. It waits for (at least) *two* > GP boundaries to elapse. I am confused. I thought that a GP boundary is the moment one GP ends and another begins. If that is so, what is the difference between waiting for the next GP to end and waiting for two GP boundaries to occur? See graphic: --*-----------|-----------|---time---> ^ rcu_call ^ ^ | | `first and `second GP boundary <--------->| next GP ` end of next GP > This is why reclaimers have two queues of callbacks > -- one of them can be (and actually *is*) extended in parallel with the > reclaimer's work, while the other one is waiting for one more GP boundary to > occur and is not extended any longer. On each GP boundary, the queue with > old callbacks is processed and swapped with the new queue atomically. > Threads external to the RCU mechanism always wait for two GP boundaries to > occur -- they are (in general) not expected to have the capability to split > their work into smaller per-GP-boundary batches the way reclaimers do. > > >> The algorithm relies on cpu's cache-coherency protocol >> to deliver (make visible) the updated cur_gp to readers. >> The same applies in the other direction when loading >> last_seen_gp in the detector. > > > It's just a heuristic and no more than that. > > If the coherency protocol doesn't make the updated cur_gp visible to > readers, they will not announce a GP observation and potentially prevent the > detector from taking the fast path. If the coherency protocol doesn't make > the updated copies of cur_gp visible to the detector, the detector will > (again) be unable to take the fast path. However, this does not impact the > correctness of the algorithm. > > Stressing the RCU mechanism extremely on a small NUMA system showed that the > cache coherency protocol works fine and that the heuristic helps a *lot*. > The number of context switches per GP caused by the detector thread seemed > to have an exponential-like distribution with an expected value around 1 on > a 24-CPU system. Very interesting. > But admittedly, this was related to the frequency of > context switches in our benchmarking workload (which is rather high on > purpose). Furthermore, it would be nice to measure this *precisely*, rather > than just guessing from context switch counters (and from a crippled > detector that starts on the same CPU all the time instead of rotating over > all the CPUs evenly). It would be great to have cache-coherence latency thoroughly benchmark. I am guessing tt might not be that easy to do. >> If I am not mistaken, the implementation [6] checks >> for threads being idle or in user space with >> CPU->cpu_mstate in OpenSolaris. However, OpenSolaris >> does not surround changes of cpu_mstate with MBs (for >> performance reasons [7]); therefore, reading its value >> from the detector is racy (eg if a CS immediatelly >> follows the change). > > > It might seem prone to races with one CS immediately following an idle / > userspace state. However, this is why the GP boundaries are announced rather > than GP "endings". The race condition you mention will never occur twice in > a sequence. Yet RCU callbacks (as well as external threads asking for GP > detection) are always delayed until at least *two* GP boundaries are > observed. I do not think this race is related to the number of GP boundaries the detector waits for. The problem is that you may exclude a cpu from checking its rcu_gp_ctr because its mstate has not yet even propagated to the detector thread. Therefore, the detector does not even check if a QS was announced or not although there may be readers running after mstate's change. Note that OpenSolaris assumes that SPARC's writes are ordered, so it is not going to be that simple to reproduce this race. > > ^^^ If I am mistaken here, and example of a pathological scenario would be > very helpful. The experience with the non-blocking hash table taught me one > important fact: Intuition is a good servant, but a poor master. :-) So none > of my explanations can be taken for granted -- there can still be bugs in > both my thoughts and the algorithm. > > _______________________________________________ > HelenOS-devel mailing list > [email protected] > http://lists.modry.cz/cgi-bin/listinfo/helenos-devel Thanks again. Adam [6] (podzimek-rcu) implementation file "rcu.patch" http://d3s.mff.cuni.cz/projects/operating_systems/rcu/rcu.patch _______________________________________________ HelenOS-devel mailing list [email protected] http://lists.modry.cz/cgi-bin/listinfo/helenos-devel
