On 10/11/2012 03:50 AM, Paul E. McKenney wrote: > On Wed, Oct 10, 2012 at 01:53:04PM -0400, Mathieu Desnoyers wrote: >> * Mathieu Desnoyers ([email protected]) wrote: >>> * Paul E. McKenney ([email protected]) wrote: >>>> On Wed, Oct 10, 2012 at 07:42:15AM -0400, Mathieu Desnoyers wrote: >>>>> * Lai Jiangshan ([email protected]) wrote: >>>>>> test code: >>>>>> ./tests/test_urcu_lfs 100 10 10 >>>>>> >>>>>> bug produce rate > 60% >>>>>> >>>>>> {{{ >>>>>> I didn't see any bug when "./tests/test_urcu_lfs 10 10 10" Or >>>>>> "./tests/test_urcu_lfs 100 100 10" >>>>>> But I just test it about 5 times >>>>>> }}} >>>>>> >>>>>> 4cores*1threads: Intel(R) Core(TM) i5 CPU 760 >>>>>> RCU_MB (no time to test for other rcu type) >>>>>> test commit: 768fba83676f49eb73fd1d8ad452016a84c5ec2a >>>>>> >>>>>> I didn't see any bug when "./tests/test_urcu_mb 10 100 10" >>>>>> >>>>>> Sorry, I tried, but I failed to find out the root cause currently. >>>>> >>>>> I think I managed to narrow down the issue: >>>>> >>>>> 1) the master branch does not reproduce it, but commit >>>>> 768fba83676f49eb73fd1d8ad452016a84c5ec2a repdroduces it about 50% of >>>>> the >>>>> time. >>>>> >>>>> 2) the main change between 768fba83676f49eb73fd1d8ad452016a84c5ec2a and >>>>> current master (f94061a3df4c9eab9ac869a19e4228de54771fcb) is call_rcu >>>>> moving to wfcqueue. >>>>> >>>>> 3) the bug always arise, for me, at the end of the 10 seconds. >>>>> However, it might be simply due to the fact that most of the memory >>>>> get freed at the end of program execution. >>>>> >>>>> 4) I've been able to get a backtrace, and it looks like we have some >>>>> call_rcu callback-invokation threads still working while >>>>> call_rcu_data_free() is invoked. In the backtrace, call_rcu_data_free() >>>>> is nicely waiting for the next thread to stop, and during that time, >>>>> two callback-invokation threads are invoking callbacks (and one of >>>>> them triggers the segfault). >>>> >>>> Do any of the callbacks reference __thread variables from some other >>>> thread? If so, those threads must refrain from exiting until after >>>> such callbacks complete. >>> >>> The callback is a simple caa_container_of + free, usual stuff, nothing >>> fancy. >> >> Here is the fix: the bug was in call rcu. It is not required for master, >> because we fixed it while moving to wfcqueue. >> >> We were erroneously writing to the head field of the default >> call_rcu_data rather than tail. > > Ouch!!! I have no idea why that would have passed my testing. :-(
It's one of the reasons that I rewrite wfqueue and introduce delete_all() (Mathieu uses splice instead) to replace open code of wfqueue in urcu-call-rcu-impl.h. > >> I wonder if we should simply do a new release with call_rcu using >> wfcqueue and tell people to upgrade, or if we should somehow create a >> stable branch with this fix. >> >> Thoughts ? > > Under what conditions does this bug appear? It is necessary to not just > use call_rcu(), but also to explicitly call call_rcu_data_free(), right? > > My guess is that a stable branch would be good -- there will be other > bugs, after all. :-/ > > Thanx, Paul > >> Thanks, >> >> Mathieu >> >> --- >> diff --git a/urcu-call-rcu-impl.h b/urcu-call-rcu-impl.h >> index 13b24ff..b205229 100644 >> --- a/urcu-call-rcu-impl.h >> +++ b/urcu-call-rcu-impl.h >> @@ -647,8 +647,9 @@ void call_rcu_data_free(struct call_rcu_data *crdp) >> /* Create default call rcu data if need be */ >> (void) get_default_call_rcu_data(); >> cbs_endprev = (struct cds_wfq_node **) >> - uatomic_xchg(&default_call_rcu_data, cbs_tail); >> - *cbs_endprev = cbs; >> + uatomic_xchg(&default_call_rcu_data->cbs.tail, >> + cbs_tail); >> + _CMM_STORE_SHARED(*cbs_endprev, cbs); >> uatomic_add(&default_call_rcu_data->qlen, >> uatomic_read(&crdp->qlen)); >> wake_call_rcu_thread(default_call_rcu_data); >> >> >>> >>> Thanks, >>> >>> Mathieu >>> >>>> >>>> Thanx, Paul >>>> >>>>> So I expect that commit >>>>> >>>>> commit 5161f31e09ce33dd79afad8d08a2372fbf1c4fbe >>>>> Author: Mathieu Desnoyers <[email protected]> >>>>> Date: Tue Sep 25 10:50:49 2012 -0500 >>>>> >>>>> call_rcu: use wfcqueue, eliminate false-sharing >>>>> >>>>> Eliminate false-sharing between call_rcu (enqueuer) and worker threads >>>>> on the queue head and tail. >>>>> I think the changelog of this commit is too short. Thanks, Lai _______________________________________________ lttng-dev mailing list [email protected] http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
