On Mon, Dec 17, 2018 at 03:15:42AM +0000, He, Bo wrote: > for double confirm the issue is not reproduce after 90 hours, we tried only > add the enclosed patch on the easy reproduced build, the issue is not > reproduced after 63 hours in the whole weekend on 16 boards. > so current conclusion is the debug patch has extreme effect on the rcu issue.
This is not a surprise. (Please see the end of this email for a replacement patch that won't suppress the bug.) To see why this is not a surprise, let's take a closer look at your patch, in light of the comment header for wait_event_idle_timeout_exclusive(): * Returns: * 0 if the @condition evaluated to %false after the @timeout elapsed, * 1 if the @condition evaluated to %true after the @timeout elapsed, * or the remaining jiffies (at least 1) if the @condition evaluated * to %true before the @timeout elapsed. The situation we are seeing is that the RCU_GP_FLAG_INIT is set, but the rcu_preempt task does not wake up. This would correspond to the second case above, that is, a return value of 1. Looking now at your patch, with comments interspersed below: ------------------------------------------------------------------------ >From e8b583aa685b3b4f304f72398a80461bba09389c Mon Sep 17 00:00:00 2001 From: "he, bo" <bo...@intel.com> Date: Sun, 9 Dec 2018 18:11:33 +0800 Subject: [PATCH] rcu: detect the preempt_rcu hang for triage jing's board Change-Id: I2ffceec2ae4847867753609e45c99afc66956003 Tracked-On: Signed-off-by: he, bo <bo...@intel.com> --- kernel/rcu/tree.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 78c0cf2..d6de363 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2192,8 +2192,13 @@ static int __noreturn rcu_gp_kthread(void *arg) int ret; struct rcu_state *rsp = arg; struct rcu_node *rnp = rcu_get_root(rsp); + pid_t rcu_preempt_pid; rcu_bind_gp_kthread(); + if(!strcmp(rsp->name, "rcu_preempt")) { + rcu_preempt_pid = rsp->gp_kthread->pid; + } + for (;;) { /* Handle grace-period start. */ @@ -2202,8 +2207,19 @@ static int __noreturn rcu_gp_kthread(void *arg) READ_ONCE(rsp->gp_seq), TPS("reqwait")); rsp->gp_state = RCU_GP_WAIT_GPS; - swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) & - RCU_GP_FLAG_INIT); + if (current->pid != rcu_preempt_pid) { + swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) & + RCU_GP_FLAG_INIT); + } else { + ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) & + RCU_GP_FLAG_INIT, 2*HZ); + + if(!ret) { We get here if ret==0. Therefore, the above "if" statement needs to instead be "if (ret == 1) {". In addition, in order to get event traces dumped, we also need: rcu_ftrace_dump(DUMP_ALL); + show_rcu_gp_kthreads(); + panic("hung_task: blocked in rcu_gp_kthread init"); + } + } + rsp->gp_state = RCU_GP_DONE_GPS; /* Locking provides needed memory barrier. */ if (rcu_gp_init(rsp)) -- 2.7.4 ------------------------------------------------------------------------ So, again, please change the "if(!ret) {" to "if (ret == 1) {", and please add "rcu_ftrace_dump(DUMP_ALL);" right after this "if" statement, as shown above. With that change, I bet that you will again see failures. > Compared with the swait_event_idle_timeout_exclusive will do 3 times to check > the condition, while swait_event_idle_ exclusive will do 2 times check the > condition. > > so today I will do another experiment, only change as below: > - swait_event_idle_exclusive(rsp->gp_wq, > READ_ONCE(rsp->gp_flags) & > - RCU_GP_FLAG_INIT); > + ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, > READ_ONCE(rsp->gp_flags) & > + RCU_GP_FLAG_INIT, MAX_SCHEDULE_TIMEOUT); > + > > Can you get some clues from the experiment? Again, please instead make the changes that I called out above, with the replacement for your patch 0001 shown below. Thanx, Paul PS. I have been testing for quite some time, but am still unable to reproduce this. So we must depend on you to reproduce it. ------------------------------------------------------------------------ diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0b760c1369f7..86152af1a580 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2153,8 +2153,13 @@ static int __noreturn rcu_gp_kthread(void *arg) int ret; struct rcu_state *rsp = arg; struct rcu_node *rnp = rcu_get_root(rsp); + pid_t rcu_preempt_pid; rcu_bind_gp_kthread(); + if(!strcmp(rsp->name, "rcu_preempt")) { + rcu_preempt_pid = rsp->gp_kthread->pid; + } + for (;;) { /* Handle grace-period start. */ @@ -2163,8 +2168,20 @@ static int __noreturn rcu_gp_kthread(void *arg) READ_ONCE(rsp->gp_seq), TPS("reqwait")); rsp->gp_state = RCU_GP_WAIT_GPS; - swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) & - RCU_GP_FLAG_INIT); + if (current->pid != rcu_preempt_pid) { + swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) & + RCU_GP_FLAG_INIT); + } else { + ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) & + RCU_GP_FLAG_INIT, 2*HZ); + + if (ret == 1) { + rcu_ftrace_dump(DUMP_ALL); + show_rcu_gp_kthreads(); + panic("hung_task: blocked in rcu_gp_kthread init"); + } + } + rsp->gp_state = RCU_GP_DONE_GPS; /* Locking provides needed memory barrier. */ if (rcu_gp_init(rsp))