On Fri, Jun 03, 2011 at 10:54:19AM +0300, Sasha Levin wrote:
> On Fri, 2011-06-03 at 09:34 +0200, Ingo Molnar wrote:
> > * Sasha Levin <[email protected]> wrote:
> > 
> > > > with no apparent progress being made.
> > > 
> > > Since it's something that worked in 2.6.37, I've looked into it to 
> > > find what might have caused this issue.
> > > 
> > > I've bisected guest kernels and found that the problem starts with:
> > > 
> > > a26ac2455ffcf3be5c6ef92bc6df7182700f2114 is the first bad commit
> > > commit a26ac2455ffcf3be5c6ef92bc6df7182700f2114
> > > Author: Paul E. McKenney <[email protected]>
> > > Date:   Wed Jan 12 14:10:23 2011 -0800
> > > 
> > >     rcu: move TREE_RCU from softirq to kthread
> > > 
> > > Ingo, could you confirm that the problem goes away for you when you 
> > > use an earlier commit?
> > 
> > testing will have to wait, but there's a recent upstream fix:
> > 
> >   d72bce0e67e8: rcu: Cure load woes
> > 
> > That *might* perhaps address this problem too.
> > 
> I've re-tested with Linus's current git, the problem is still there.
> 
> > If not then this appears to be some sort of RCU related livelock with 
> > brutally overcommitted vcpus. On native this would show up too, in a 
> > less drastic form, as a spurious bootup delay.
> 
> I don't think it was overcommited by *that* much. With that commit it
> usually hangs at 20-40 vcpus, while without it I can go up to 255.

Here is a diagnostic patch, untested.  It assumes that your system
has only a few CPUs (maybe 8-16) and that timers are still running.
It dumps out some RCU state if grace periods extend for more than
a few seconds.

To activate it, call rcu_diag_timer_start() from process context.
To stop it, call rcu_diag_timer_stop(), also from process context.

Thoughts?

                                                        Thanx, Paul

------------------------------------------------------------------------

rcu: diagnostic check of kthread state
    
Not-signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 99f9aa7..489ea1b 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -80,6 +80,8 @@ extern void call_rcu_sched(struct rcu_head *head,
 extern void synchronize_sched(void);
 extern void rcu_barrier_bh(void);
 extern void rcu_barrier_sched(void);
+extern void rcu_diag_timer_start(void);
+extern void rcu_diag_timer_stop(void);
 
 static inline void __rcu_read_lock_bh(void)
 {
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 89419ff..bb61574 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2423,3 +2423,48 @@ void __init rcu_init(void)
 }
 
 #include "rcutree_plugin.h"
+
+/* Diagnostic code for boot-time hangs observed in early 3.0 days. */
+
+static int rcu_diag_timer_must_stop;
+struct timer_list rcu_diag_timer;
+#define RCU_DIAG_TIMER_PERIOD  (10 * HZ)
+
+static void rcu_diag_timer_handler(unsigned long unused)
+{
+       int cpu;
+
+       if (rcu_diag_timer_must_stop)
+               return;
+
+       if(ULONG_CMP_GE(jiffies,
+                       rcu_sched_state.gp_start + RCU_DIAG_TIMER_PERIOD))
+               for_each_online_cpu(cpu) {
+                       printk(KERN_ALERT "rcu_diag: rcuc%d %u/%u/%d ",
+                              cpu,
+                              per_cpu(rcu_cpu_kthread_status, cpu),
+                              per_cpu(rcu_cpu_kthread_loops, cpu),
+                              per_cpu(rcu_cpu_has_work, cpu));
+                       sched_show_task(current);
+               }
+
+       if (rcu_diag_timer_must_stop)
+               return;
+       mod_timer(&rcu_diag_timer, RCU_DIAG_TIMER_PERIOD + jiffies);
+}
+
+void rcu_diag_timer_start(void)
+{
+       rcu_diag_timer_must_stop = 0;
+       setup_timer(&rcu_diag_timer,
+                   rcu_diag_timer_handler, (unsigned long) NULL);
+       mod_timer(&rcu_diag_timer, RCU_DIAG_TIMER_PERIOD + jiffies);
+}
+EXPORT_SYMBOL_GPL(rcu_diag_timer_start);
+
+void rcu_diag_timer_stop(void)
+{
+       rcu_diag_timer_must_stop = 1;
+       del_timer(&rcu_diag_timer);
+}
+EXPORT_SYMBOL_GPL(rcu_diag_timer_stop);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to