Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-28 Thread Paul E. McKenney
On Tue, Oct 28, 2014 at 09:54:28AM -0600, Kevin Fenzi wrote:
> Just FYI, this solves the orig issue for me as well. ;) 
> 
> Thanks for all the work in tracking it down... 
> 
> Tested-by: Kevin Fenzi 

And thank you for testing as well!

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-28 Thread Kevin Fenzi
Just FYI, this solves the orig issue for me as well. ;) 

Thanks for all the work in tracking it down... 

Tested-by: Kevin Fenzi 

kevin




signature.asc
Description: PGP signature


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-28 Thread Yanko Kaneti
On Tue-10/28/14-2014 05:50, Paul E. McKenney wrote:
> On Tue, Oct 28, 2014 at 10:12:43AM +0200, Yanko Kaneti wrote:
> > On Mon-10/27/14-2014 10:45, Paul E. McKenney wrote:
> > > On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote:
> > > > On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
> > > > > Paul E. McKenney  wrote:
> > > > > 
> > > > > >On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
> > > > > >>Looking at the dmesg, the early boot messages seem to be
> > > > > >> confused as to how many CPUs there are, e.g.,
> > > > > >> 
> > > > > >> [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, 
> > > > > >> Nodes=1
> > > > > >> [0.00] Hierarchical RCU implementation.
> > > > > >> [0.00]  RCU debugfs-based tracing is enabled.
> > > > > >> [0.00]  RCU dyntick-idle grace-period acceleration is 
> > > > > >> enabled.
> > > > > >> [0.00]  RCU restricting CPUs from NR_CPUS=256 to 
> > > > > >> nr_cpu_ids=4.
> > > > > >> [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, 
> > > > > >> nr_cpu_ids=4
> > > > > >> [0.00] NR_IRQS:16640 nr_irqs:456 0
> > > > > >> [0.00]  Offload RCU callbacks from all CPUs
> > > > > >> [0.00]  Offload RCU callbacks from CPUs: 0-3.
> > > > > >> 
> > > > > >>but later shows 2:
> > > > > >> 
> > > > > >> [0.233703] x86: Booting SMP configuration:
> > > > > >> [0.236003]  node  #0, CPUs:  #1
> > > > > >> [0.255528] x86: Booted up 1 node, 2 CPUs
> > > > > >> 
> > > > > >>In any event, the E8400 is a 2 core CPU with no hyperthreading.
> > > > > >
> > > > > >Well, this might explain some of the difficulties.  If RCU decides 
> > > > > >to wait
> > > > > >on CPUs that don't exist, we will of course get a hang.  And 
> > > > > >rcu_barrier()
> > > > > >was definitely expecting four CPUs.
> > > > > >
> > > > > >So what happens if you boot with maxcpus=2?  (Or build with
> > > > > >CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
> > > > > >I might have some ideas for a real fix.
> > > > > 
> > > > >   Booting with maxcpus=2 makes no difference (the dmesg output is
> > > > > the same).
> > > > > 
> > > > >   Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
> > > > > dmesg has different CPU information at boot:
> > > > > 
> > > > > [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
> > > > > [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
> > > > >  [...]
> > > > > [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
> > > > > nr_node_ids:1
> > > > >  [...]
> > > > > [0.00] Hierarchical RCU implementation.
> > > > > [0.00]RCU debugfs-based tracing is enabled.
> > > > > [0.00]RCU dyntick-idle grace-period acceleration is 
> > > > > enabled.
> > > > > [0.00] NR_IRQS:4352 nr_irqs:440 0
> > > > > [0.00]Offload RCU callbacks from all CPUs
> > > > > [0.00]Offload RCU callbacks from CPUs: 0-1.
> > > > 
> > > > Thank you -- this confirms my suspicions on the fix, though I must admit
> > > > to being surprised that maxcpus made no difference.
> > > 
> > > And here is an alleged fix, lightly tested at this end.  Does this patch
> > > help?
> > 
> > Tested this on top of rc2 (as found in Fedora, and failing without the 
> > patch)
> > with all my modprobe scenarios and it seems to have fixed it.
> 
> Very good!  May I apply your Tested-by?

Sure. Sorry didn't include this earlier

Tested-by: Yanko Kaneti 
 
>   Thanx, Paul
> 
> > Thanks
> > -Yanko
> > 
> > 
> > >   Thanx, Paul
> > > 
> > > 
> > > 
> > > rcu: Make rcu_barrier() understand about missing rcuo kthreads
> > > 
> > > Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
> > > avoids creating rcuo kthreads for CPUs that never come online.  This
> > > fixes a bug in many instances of firmware: Instead of lying about their
> > > age, these systems instead lie about the number of CPUs that they have.
> > > Before commit 35ce7f29a44a, this could result in huge numbers of useless
> > > rcuo kthreads being created.
> > > 
> > > It appears that experience indicates that I should have told the
> > > people suffering from this problem to fix their broken firmware, but
> > > I instead produced what turned out to be a partial fix.   The missing
> > > piece supplied by this commit makes sure that rcu_barrier() knows not to
> > > post callbacks for no-CBs CPUs that have not yet come online, because
> > > otherwise rcu_barrier() will hang on systems having firmware that lies
> > > about the number of CPUs.
> > > 
> > > It is tempting to simply have rcu_barrier() refuse to post a callback on
> > > any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
> 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-28 Thread Paul E. McKenney
On Tue, Oct 28, 2014 at 10:12:43AM +0200, Yanko Kaneti wrote:
> On Mon-10/27/14-2014 10:45, Paul E. McKenney wrote:
> > On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote:
> > > On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
> > > > Paul E. McKenney  wrote:
> > > > 
> > > > >On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
> > > > >>  Looking at the dmesg, the early boot messages seem to be
> > > > >> confused as to how many CPUs there are, e.g.,
> > > > >> 
> > > > >> [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, 
> > > > >> Nodes=1
> > > > >> [0.00] Hierarchical RCU implementation.
> > > > >> [0.00]  RCU debugfs-based tracing is enabled.
> > > > >> [0.00]  RCU dyntick-idle grace-period acceleration is 
> > > > >> enabled.
> > > > >> [0.00]  RCU restricting CPUs from NR_CPUS=256 to 
> > > > >> nr_cpu_ids=4.
> > > > >> [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, 
> > > > >> nr_cpu_ids=4
> > > > >> [0.00] NR_IRQS:16640 nr_irqs:456 0
> > > > >> [0.00]  Offload RCU callbacks from all CPUs
> > > > >> [0.00]  Offload RCU callbacks from CPUs: 0-3.
> > > > >> 
> > > > >>  but later shows 2:
> > > > >> 
> > > > >> [0.233703] x86: Booting SMP configuration:
> > > > >> [0.236003]  node  #0, CPUs:  #1
> > > > >> [0.255528] x86: Booted up 1 node, 2 CPUs
> > > > >> 
> > > > >>  In any event, the E8400 is a 2 core CPU with no hyperthreading.
> > > > >
> > > > >Well, this might explain some of the difficulties.  If RCU decides to 
> > > > >wait
> > > > >on CPUs that don't exist, we will of course get a hang.  And 
> > > > >rcu_barrier()
> > > > >was definitely expecting four CPUs.
> > > > >
> > > > >So what happens if you boot with maxcpus=2?  (Or build with
> > > > >CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
> > > > >I might have some ideas for a real fix.
> > > > 
> > > > Booting with maxcpus=2 makes no difference (the dmesg output is
> > > > the same).
> > > > 
> > > > Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
> > > > dmesg has different CPU information at boot:
> > > > 
> > > > [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
> > > > [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
> > > >  [...]
> > > > [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
> > > > nr_node_ids:1
> > > >  [...]
> > > > [0.00] Hierarchical RCU implementation.
> > > > [0.00]  RCU debugfs-based tracing is enabled.
> > > > [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
> > > > [0.00] NR_IRQS:4352 nr_irqs:440 0
> > > > [0.00]  Offload RCU callbacks from all CPUs
> > > > [0.00]  Offload RCU callbacks from CPUs: 0-1.
> > > 
> > > Thank you -- this confirms my suspicions on the fix, though I must admit
> > > to being surprised that maxcpus made no difference.
> > 
> > And here is an alleged fix, lightly tested at this end.  Does this patch
> > help?
> 
> Tested this on top of rc2 (as found in Fedora, and failing without the patch)
> with all my modprobe scenarios and it seems to have fixed it.

Very good!  May I apply your Tested-by?

Thanx, Paul

> Thanks
> -Yanko
> 
> 
> > Thanx, Paul
> > 
> > 
> > 
> > rcu: Make rcu_barrier() understand about missing rcuo kthreads
> > 
> > Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
> > avoids creating rcuo kthreads for CPUs that never come online.  This
> > fixes a bug in many instances of firmware: Instead of lying about their
> > age, these systems instead lie about the number of CPUs that they have.
> > Before commit 35ce7f29a44a, this could result in huge numbers of useless
> > rcuo kthreads being created.
> > 
> > It appears that experience indicates that I should have told the
> > people suffering from this problem to fix their broken firmware, but
> > I instead produced what turned out to be a partial fix.   The missing
> > piece supplied by this commit makes sure that rcu_barrier() knows not to
> > post callbacks for no-CBs CPUs that have not yet come online, because
> > otherwise rcu_barrier() will hang on systems having firmware that lies
> > about the number of CPUs.
> > 
> > It is tempting to simply have rcu_barrier() refuse to post a callback on
> > any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
> > does not work because rcu_barrier() is required to wait for all pending
> > callbacks.  It is therefore required to wait even for those callbacks
> > that cannot possibly be invoked.  Even if doing so hangs the system.
> > 
> > Given that posting a callback to a no-CBs CPU that does not yet have an
> > rcuo kthread can hang rcu_barrier(), It 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-28 Thread Yanko Kaneti
On Mon-10/27/14-2014 10:45, Paul E. McKenney wrote:
> On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote:
> > On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
> > > Paul E. McKenney  wrote:
> > > 
> > > >On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
> > > >>Looking at the dmesg, the early boot messages seem to be
> > > >> confused as to how many CPUs there are, e.g.,
> > > >> 
> > > >> [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, 
> > > >> Nodes=1
> > > >> [0.00] Hierarchical RCU implementation.
> > > >> [0.00]  RCU debugfs-based tracing is enabled.
> > > >> [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
> > > >> [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
> > > >> [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, 
> > > >> nr_cpu_ids=4
> > > >> [0.00] NR_IRQS:16640 nr_irqs:456 0
> > > >> [0.00]  Offload RCU callbacks from all CPUs
> > > >> [0.00]  Offload RCU callbacks from CPUs: 0-3.
> > > >> 
> > > >>but later shows 2:
> > > >> 
> > > >> [0.233703] x86: Booting SMP configuration:
> > > >> [0.236003]  node  #0, CPUs:  #1
> > > >> [0.255528] x86: Booted up 1 node, 2 CPUs
> > > >> 
> > > >>In any event, the E8400 is a 2 core CPU with no hyperthreading.
> > > >
> > > >Well, this might explain some of the difficulties.  If RCU decides to 
> > > >wait
> > > >on CPUs that don't exist, we will of course get a hang.  And 
> > > >rcu_barrier()
> > > >was definitely expecting four CPUs.
> > > >
> > > >So what happens if you boot with maxcpus=2?  (Or build with
> > > >CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
> > > >I might have some ideas for a real fix.
> > > 
> > >   Booting with maxcpus=2 makes no difference (the dmesg output is
> > > the same).
> > > 
> > >   Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
> > > dmesg has different CPU information at boot:
> > > 
> > > [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
> > > [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
> > >  [...]
> > > [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
> > > nr_node_ids:1
> > >  [...]
> > > [0.00] Hierarchical RCU implementation.
> > > [0.00]RCU debugfs-based tracing is enabled.
> > > [0.00]RCU dyntick-idle grace-period acceleration is enabled.
> > > [0.00] NR_IRQS:4352 nr_irqs:440 0
> > > [0.00]Offload RCU callbacks from all CPUs
> > > [0.00]Offload RCU callbacks from CPUs: 0-1.
> > 
> > Thank you -- this confirms my suspicions on the fix, though I must admit
> > to being surprised that maxcpus made no difference.
> 
> And here is an alleged fix, lightly tested at this end.  Does this patch
> help?

Tested this on top of rc2 (as found in Fedora, and failing without the patch)
with all my modprobe scenarios and it seems to have fixed it.

Thanks
-Yanko

 
>   Thanx, Paul
> 
> 
> 
> rcu: Make rcu_barrier() understand about missing rcuo kthreads
> 
> Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
> avoids creating rcuo kthreads for CPUs that never come online.  This
> fixes a bug in many instances of firmware: Instead of lying about their
> age, these systems instead lie about the number of CPUs that they have.
> Before commit 35ce7f29a44a, this could result in huge numbers of useless
> rcuo kthreads being created.
> 
> It appears that experience indicates that I should have told the
> people suffering from this problem to fix their broken firmware, but
> I instead produced what turned out to be a partial fix.   The missing
> piece supplied by this commit makes sure that rcu_barrier() knows not to
> post callbacks for no-CBs CPUs that have not yet come online, because
> otherwise rcu_barrier() will hang on systems having firmware that lies
> about the number of CPUs.
> 
> It is tempting to simply have rcu_barrier() refuse to post a callback on
> any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
> does not work because rcu_barrier() is required to wait for all pending
> callbacks.  It is therefore required to wait even for those callbacks
> that cannot possibly be invoked.  Even if doing so hangs the system.
> 
> Given that posting a callback to a no-CBs CPU that does not yet have an
> rcuo kthread can hang rcu_barrier(), It is tempting to report an error
> in this case.  Unfortunately, this will result in false positives at
> boot time, when it is perfectly legal to post callbacks to the boot CPU
> before the scheduler has started, in other words, before it is legal
> to invoke rcu_barrier().
> 
> So this commit instead has rcu_barrier() avoid posting callbacks to
> CPUs having neither rcuo kthread nor 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-28 Thread Yanko Kaneti
On Mon-10/27/14-2014 10:45, Paul E. McKenney wrote:
 On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote:
  On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
   Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
   
   On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
   Looking at the dmesg, the early boot messages seem to be
confused as to how many CPUs there are, e.g.,

[0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, 
Nodes=1
[0.00] Hierarchical RCU implementation.
[0.00]  RCU debugfs-based tracing is enabled.
[0.00]  RCU dyntick-idle grace-period acceleration is enabled.
[0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
[0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, 
nr_cpu_ids=4
[0.00] NR_IRQS:16640 nr_irqs:456 0
[0.00]  Offload RCU callbacks from all CPUs
[0.00]  Offload RCU callbacks from CPUs: 0-3.

   but later shows 2:

[0.233703] x86: Booting SMP configuration:
[0.236003]  node  #0, CPUs:  #1
[0.255528] x86: Booted up 1 node, 2 CPUs

   In any event, the E8400 is a 2 core CPU with no hyperthreading.
   
   Well, this might explain some of the difficulties.  If RCU decides to 
   wait
   on CPUs that don't exist, we will of course get a hang.  And 
   rcu_barrier()
   was definitely expecting four CPUs.
   
   So what happens if you boot with maxcpus=2?  (Or build with
   CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
   I might have some ideas for a real fix.
   
 Booting with maxcpus=2 makes no difference (the dmesg output is
   the same).
   
 Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
   dmesg has different CPU information at boot:
   
   [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
   [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
[...]
   [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
   nr_node_ids:1
[...]
   [0.00] Hierarchical RCU implementation.
   [0.00]RCU debugfs-based tracing is enabled.
   [0.00]RCU dyntick-idle grace-period acceleration is enabled.
   [0.00] NR_IRQS:4352 nr_irqs:440 0
   [0.00]Offload RCU callbacks from all CPUs
   [0.00]Offload RCU callbacks from CPUs: 0-1.
  
  Thank you -- this confirms my suspicions on the fix, though I must admit
  to being surprised that maxcpus made no difference.
 
 And here is an alleged fix, lightly tested at this end.  Does this patch
 help?

Tested this on top of rc2 (as found in Fedora, and failing without the patch)
with all my modprobe scenarios and it seems to have fixed it.

Thanks
-Yanko

 
   Thanx, Paul
 
 
 
 rcu: Make rcu_barrier() understand about missing rcuo kthreads
 
 Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
 avoids creating rcuo kthreads for CPUs that never come online.  This
 fixes a bug in many instances of firmware: Instead of lying about their
 age, these systems instead lie about the number of CPUs that they have.
 Before commit 35ce7f29a44a, this could result in huge numbers of useless
 rcuo kthreads being created.
 
 It appears that experience indicates that I should have told the
 people suffering from this problem to fix their broken firmware, but
 I instead produced what turned out to be a partial fix.   The missing
 piece supplied by this commit makes sure that rcu_barrier() knows not to
 post callbacks for no-CBs CPUs that have not yet come online, because
 otherwise rcu_barrier() will hang on systems having firmware that lies
 about the number of CPUs.
 
 It is tempting to simply have rcu_barrier() refuse to post a callback on
 any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
 does not work because rcu_barrier() is required to wait for all pending
 callbacks.  It is therefore required to wait even for those callbacks
 that cannot possibly be invoked.  Even if doing so hangs the system.
 
 Given that posting a callback to a no-CBs CPU that does not yet have an
 rcuo kthread can hang rcu_barrier(), It is tempting to report an error
 in this case.  Unfortunately, this will result in false positives at
 boot time, when it is perfectly legal to post callbacks to the boot CPU
 before the scheduler has started, in other words, before it is legal
 to invoke rcu_barrier().
 
 So this commit instead has rcu_barrier() avoid posting callbacks to
 CPUs having neither rcuo kthread nor pending callbacks, and has it
 complain bitterly if it finds CPUs having no rcuo kthread but some
 pending callbacks.  And when rcu_barrier() does find CPUs having no rcuo
 kthread but pending callbacks, as noted earlier, it has no choice 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-28 Thread Paul E. McKenney
On Tue, Oct 28, 2014 at 10:12:43AM +0200, Yanko Kaneti wrote:
 On Mon-10/27/14-2014 10:45, Paul E. McKenney wrote:
  On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote:
   On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
  Looking at the dmesg, the early boot messages seem to be
 confused as to how many CPUs there are, e.g.,
 
 [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, 
 Nodes=1
 [0.00] Hierarchical RCU implementation.
 [0.00]  RCU debugfs-based tracing is enabled.
 [0.00]  RCU dyntick-idle grace-period acceleration is 
 enabled.
 [0.00]  RCU restricting CPUs from NR_CPUS=256 to 
 nr_cpu_ids=4.
 [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, 
 nr_cpu_ids=4
 [0.00] NR_IRQS:16640 nr_irqs:456 0
 [0.00]  Offload RCU callbacks from all CPUs
 [0.00]  Offload RCU callbacks from CPUs: 0-3.
 
  but later shows 2:
 
 [0.233703] x86: Booting SMP configuration:
 [0.236003]  node  #0, CPUs:  #1
 [0.255528] x86: Booted up 1 node, 2 CPUs
 
  In any event, the E8400 is a 2 core CPU with no hyperthreading.

Well, this might explain some of the difficulties.  If RCU decides to 
wait
on CPUs that don't exist, we will of course get a hang.  And 
rcu_barrier()
was definitely expecting four CPUs.

So what happens if you boot with maxcpus=2?  (Or build with
CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
I might have some ideas for a real fix.

Booting with maxcpus=2 makes no difference (the dmesg output is
the same).

Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
dmesg has different CPU information at boot:

[0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
[0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
 [...]
[0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
nr_node_ids:1
 [...]
[0.00] Hierarchical RCU implementation.
[0.00]  RCU debugfs-based tracing is enabled.
[0.00]  RCU dyntick-idle grace-period acceleration is enabled.
[0.00] NR_IRQS:4352 nr_irqs:440 0
[0.00]  Offload RCU callbacks from all CPUs
[0.00]  Offload RCU callbacks from CPUs: 0-1.
   
   Thank you -- this confirms my suspicions on the fix, though I must admit
   to being surprised that maxcpus made no difference.
  
  And here is an alleged fix, lightly tested at this end.  Does this patch
  help?
 
 Tested this on top of rc2 (as found in Fedora, and failing without the patch)
 with all my modprobe scenarios and it seems to have fixed it.

Very good!  May I apply your Tested-by?

Thanx, Paul

 Thanks
 -Yanko
 
 
  Thanx, Paul
  
  
  
  rcu: Make rcu_barrier() understand about missing rcuo kthreads
  
  Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
  avoids creating rcuo kthreads for CPUs that never come online.  This
  fixes a bug in many instances of firmware: Instead of lying about their
  age, these systems instead lie about the number of CPUs that they have.
  Before commit 35ce7f29a44a, this could result in huge numbers of useless
  rcuo kthreads being created.
  
  It appears that experience indicates that I should have told the
  people suffering from this problem to fix their broken firmware, but
  I instead produced what turned out to be a partial fix.   The missing
  piece supplied by this commit makes sure that rcu_barrier() knows not to
  post callbacks for no-CBs CPUs that have not yet come online, because
  otherwise rcu_barrier() will hang on systems having firmware that lies
  about the number of CPUs.
  
  It is tempting to simply have rcu_barrier() refuse to post a callback on
  any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
  does not work because rcu_barrier() is required to wait for all pending
  callbacks.  It is therefore required to wait even for those callbacks
  that cannot possibly be invoked.  Even if doing so hangs the system.
  
  Given that posting a callback to a no-CBs CPU that does not yet have an
  rcuo kthread can hang rcu_barrier(), It is tempting to report an error
  in this case.  Unfortunately, this will result in false positives at
  boot time, when it is perfectly legal to post callbacks to the boot CPU
  before the scheduler has started, in other words, before it is legal
  to invoke rcu_barrier().
  
  So this commit instead has rcu_barrier() avoid posting 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-28 Thread Yanko Kaneti
On Tue-10/28/14-2014 05:50, Paul E. McKenney wrote:
 On Tue, Oct 28, 2014 at 10:12:43AM +0200, Yanko Kaneti wrote:
  On Mon-10/27/14-2014 10:45, Paul E. McKenney wrote:
   On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote:
On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
 Looking at the dmesg, the early boot messages seem to be
  confused as to how many CPUs there are, e.g.,
  
  [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, 
  Nodes=1
  [0.00] Hierarchical RCU implementation.
  [0.00]  RCU debugfs-based tracing is enabled.
  [0.00]  RCU dyntick-idle grace-period acceleration is 
  enabled.
  [0.00]  RCU restricting CPUs from NR_CPUS=256 to 
  nr_cpu_ids=4.
  [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, 
  nr_cpu_ids=4
  [0.00] NR_IRQS:16640 nr_irqs:456 0
  [0.00]  Offload RCU callbacks from all CPUs
  [0.00]  Offload RCU callbacks from CPUs: 0-3.
  
 but later shows 2:
  
  [0.233703] x86: Booting SMP configuration:
  [0.236003]  node  #0, CPUs:  #1
  [0.255528] x86: Booted up 1 node, 2 CPUs
  
 In any event, the E8400 is a 2 core CPU with no hyperthreading.
 
 Well, this might explain some of the difficulties.  If RCU decides 
 to wait
 on CPUs that don't exist, we will of course get a hang.  And 
 rcu_barrier()
 was definitely expecting four CPUs.
 
 So what happens if you boot with maxcpus=2?  (Or build with
 CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
 I might have some ideas for a real fix.
 
   Booting with maxcpus=2 makes no difference (the dmesg output is
 the same).
 
   Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
 dmesg has different CPU information at boot:
 
 [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
 [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
  [...]
 [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
 nr_node_ids:1
  [...]
 [0.00] Hierarchical RCU implementation.
 [0.00]RCU debugfs-based tracing is enabled.
 [0.00]RCU dyntick-idle grace-period acceleration is 
 enabled.
 [0.00] NR_IRQS:4352 nr_irqs:440 0
 [0.00]Offload RCU callbacks from all CPUs
 [0.00]Offload RCU callbacks from CPUs: 0-1.

Thank you -- this confirms my suspicions on the fix, though I must admit
to being surprised that maxcpus made no difference.
   
   And here is an alleged fix, lightly tested at this end.  Does this patch
   help?
  
  Tested this on top of rc2 (as found in Fedora, and failing without the 
  patch)
  with all my modprobe scenarios and it seems to have fixed it.
 
 Very good!  May I apply your Tested-by?

Sure. Sorry didn't include this earlier

Tested-by: Yanko Kaneti yan...@declera.com
 
   Thanx, Paul
 
  Thanks
  -Yanko
  
  
 Thanx, Paul
   
   
   
   rcu: Make rcu_barrier() understand about missing rcuo kthreads
   
   Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
   avoids creating rcuo kthreads for CPUs that never come online.  This
   fixes a bug in many instances of firmware: Instead of lying about their
   age, these systems instead lie about the number of CPUs that they have.
   Before commit 35ce7f29a44a, this could result in huge numbers of useless
   rcuo kthreads being created.
   
   It appears that experience indicates that I should have told the
   people suffering from this problem to fix their broken firmware, but
   I instead produced what turned out to be a partial fix.   The missing
   piece supplied by this commit makes sure that rcu_barrier() knows not to
   post callbacks for no-CBs CPUs that have not yet come online, because
   otherwise rcu_barrier() will hang on systems having firmware that lies
   about the number of CPUs.
   
   It is tempting to simply have rcu_barrier() refuse to post a callback on
   any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
   does not work because rcu_barrier() is required to wait for all pending
   callbacks.  It is therefore required to wait even for those callbacks
   that cannot possibly be invoked.  Even if doing so hangs the system.
   
   Given that posting a callback to a no-CBs CPU that does not yet have an
   rcuo kthread can hang rcu_barrier(), It is tempting to report an error
   in this case.  Unfortunately, this will 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-28 Thread Kevin Fenzi
Just FYI, this solves the orig issue for me as well. ;) 

Thanks for all the work in tracking it down... 

Tested-by: Kevin Fenzi ke...@scrye.com

kevin




signature.asc
Description: PGP signature


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-28 Thread Paul E. McKenney
On Tue, Oct 28, 2014 at 09:54:28AM -0600, Kevin Fenzi wrote:
 Just FYI, this solves the orig issue for me as well. ;) 
 
 Thanks for all the work in tracking it down... 
 
 Tested-by: Kevin Fenzi ke...@scrye.com

And thank you for testing as well!

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-27 Thread Paul E. McKenney
On Mon, Oct 27, 2014 at 01:43:21PM -0700, Jay Vosburgh wrote:
> Paul E. McKenney  wrote:
> 
> >On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote:
> >> On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
> >> > Paul E. McKenney  wrote:
> >> > 
> >> > >On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
> >> > >>   Looking at the dmesg, the early boot messages seem to be
> >> > >> confused as to how many CPUs there are, e.g.,
> >> > >> 
> >> > >> [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, 
> >> > >> Nodes=1
> >> > >> [0.00] Hierarchical RCU implementation.
> >> > >> [0.00]  RCU debugfs-based tracing is enabled.
> >> > >> [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
> >> > >> [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
> >> > >> [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, 
> >> > >> nr_cpu_ids=4
> >> > >> [0.00] NR_IRQS:16640 nr_irqs:456 0
> >> > >> [0.00]  Offload RCU callbacks from all CPUs
> >> > >> [0.00]  Offload RCU callbacks from CPUs: 0-3.
> >> > >> 
> >> > >>   but later shows 2:
> >> > >> 
> >> > >> [0.233703] x86: Booting SMP configuration:
> >> > >> [0.236003]  node  #0, CPUs:  #1
> >> > >> [0.255528] x86: Booted up 1 node, 2 CPUs
> >> > >> 
> >> > >>   In any event, the E8400 is a 2 core CPU with no hyperthreading.
> >> > >
> >> > >Well, this might explain some of the difficulties.  If RCU decides to 
> >> > >wait
> >> > >on CPUs that don't exist, we will of course get a hang.  And 
> >> > >rcu_barrier()
> >> > >was definitely expecting four CPUs.
> >> > >
> >> > >So what happens if you boot with maxcpus=2?  (Or build with
> >> > >CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
> >> > >I might have some ideas for a real fix.
> >> > 
> >> >  Booting with maxcpus=2 makes no difference (the dmesg output is
> >> > the same).
> >> > 
> >> >  Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
> >> > dmesg has different CPU information at boot:
> >> > 
> >> > [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
> >> > [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
> >> >  [...]
> >> > [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
> >> > nr_node_ids:1
> >> >  [...]
> >> > [0.00] Hierarchical RCU implementation.
> >> > [0.00]   RCU debugfs-based tracing is enabled.
> >> > [0.00]   RCU dyntick-idle grace-period acceleration is enabled.
> >> > [0.00] NR_IRQS:4352 nr_irqs:440 0
> >> > [0.00]   Offload RCU callbacks from all CPUs
> >> > [0.00]   Offload RCU callbacks from CPUs: 0-1.
> >> 
> >> Thank you -- this confirms my suspicions on the fix, though I must admit
> >> to being surprised that maxcpus made no difference.
> >
> >And here is an alleged fix, lightly tested at this end.  Does this patch
> >help?
> 
>   This patch appears to make the problem go away; I've run about
> 10 iterations.  I applied this patch to the same -net tree I was using
> previously (-net as of Oct 22), with all other test patches removed.

So I finally produced a patch that helps!  It was bound to happen sooner
or later, I guess.  ;-)

>   FWIW, dmesg is unchanged, and still shows messages like:
> 
> [0.00]  Offload RCU callbacks from CPUs: 0-3.

Yep, at that point in boot, RCU has no way of knowing that the firmware
is lying to it about the number of CPUs.  ;-)

> Tested-by: Jay Vosburgh 

Thank you for your testing efforts!!!

Thanx, Paul

>   -J
> >
> >
> >
> >rcu: Make rcu_barrier() understand about missing rcuo kthreads
> >
> >Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
> >avoids creating rcuo kthreads for CPUs that never come online.  This
> >fixes a bug in many instances of firmware: Instead of lying about their
> >age, these systems instead lie about the number of CPUs that they have.
> >Before commit 35ce7f29a44a, this could result in huge numbers of useless
> >rcuo kthreads being created.
> >
> >It appears that experience indicates that I should have told the
> >people suffering from this problem to fix their broken firmware, but
> >I instead produced what turned out to be a partial fix.   The missing
> >piece supplied by this commit makes sure that rcu_barrier() knows not to
> >post callbacks for no-CBs CPUs that have not yet come online, because
> >otherwise rcu_barrier() will hang on systems having firmware that lies
> >about the number of CPUs.
> >
> >It is tempting to simply have rcu_barrier() refuse to post a callback on
> >any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
> >does not work because rcu_barrier() is required to wait for all pending
> >callbacks.  It is therefore required to wait even 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-27 Thread Jay Vosburgh
Paul E. McKenney  wrote:

>On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote:
>> On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
>> > Paul E. McKenney  wrote:
>> > 
>> > >On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
>> > >> Looking at the dmesg, the early boot messages seem to be
>> > >> confused as to how many CPUs there are, e.g.,
>> > >> 
>> > >> [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, 
>> > >> Nodes=1
>> > >> [0.00] Hierarchical RCU implementation.
>> > >> [0.00]  RCU debugfs-based tracing is enabled.
>> > >> [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
>> > >> [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
>> > >> [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, 
>> > >> nr_cpu_ids=4
>> > >> [0.00] NR_IRQS:16640 nr_irqs:456 0
>> > >> [0.00]  Offload RCU callbacks from all CPUs
>> > >> [0.00]  Offload RCU callbacks from CPUs: 0-3.
>> > >> 
>> > >> but later shows 2:
>> > >> 
>> > >> [0.233703] x86: Booting SMP configuration:
>> > >> [0.236003]  node  #0, CPUs:  #1
>> > >> [0.255528] x86: Booted up 1 node, 2 CPUs
>> > >> 
>> > >> In any event, the E8400 is a 2 core CPU with no hyperthreading.
>> > >
>> > >Well, this might explain some of the difficulties.  If RCU decides to wait
>> > >on CPUs that don't exist, we will of course get a hang.  And rcu_barrier()
>> > >was definitely expecting four CPUs.
>> > >
>> > >So what happens if you boot with maxcpus=2?  (Or build with
>> > >CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
>> > >I might have some ideas for a real fix.
>> > 
>> >Booting with maxcpus=2 makes no difference (the dmesg output is
>> > the same).
>> > 
>> >Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
>> > dmesg has different CPU information at boot:
>> > 
>> > [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
>> > [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
>> >  [...]
>> > [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
>> > nr_node_ids:1
>> >  [...]
>> > [0.00] Hierarchical RCU implementation.
>> > [0.00] RCU debugfs-based tracing is enabled.
>> > [0.00] RCU dyntick-idle grace-period acceleration is enabled.
>> > [0.00] NR_IRQS:4352 nr_irqs:440 0
>> > [0.00] Offload RCU callbacks from all CPUs
>> > [0.00] Offload RCU callbacks from CPUs: 0-1.
>> 
>> Thank you -- this confirms my suspicions on the fix, though I must admit
>> to being surprised that maxcpus made no difference.
>
>And here is an alleged fix, lightly tested at this end.  Does this patch
>help?

This patch appears to make the problem go away; I've run about
10 iterations.  I applied this patch to the same -net tree I was using
previously (-net as of Oct 22), with all other test patches removed.

FWIW, dmesg is unchanged, and still shows messages like:

[0.00]  Offload RCU callbacks from CPUs: 0-3.

Tested-by: Jay Vosburgh 

-J

>   Thanx, Paul
>
>
>
>rcu: Make rcu_barrier() understand about missing rcuo kthreads
>
>Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
>avoids creating rcuo kthreads for CPUs that never come online.  This
>fixes a bug in many instances of firmware: Instead of lying about their
>age, these systems instead lie about the number of CPUs that they have.
>Before commit 35ce7f29a44a, this could result in huge numbers of useless
>rcuo kthreads being created.
>
>It appears that experience indicates that I should have told the
>people suffering from this problem to fix their broken firmware, but
>I instead produced what turned out to be a partial fix.   The missing
>piece supplied by this commit makes sure that rcu_barrier() knows not to
>post callbacks for no-CBs CPUs that have not yet come online, because
>otherwise rcu_barrier() will hang on systems having firmware that lies
>about the number of CPUs.
>
>It is tempting to simply have rcu_barrier() refuse to post a callback on
>any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
>does not work because rcu_barrier() is required to wait for all pending
>callbacks.  It is therefore required to wait even for those callbacks
>that cannot possibly be invoked.  Even if doing so hangs the system.
>
>Given that posting a callback to a no-CBs CPU that does not yet have an
>rcuo kthread can hang rcu_barrier(), It is tempting to report an error
>in this case.  Unfortunately, this will result in false positives at
>boot time, when it is perfectly legal to post callbacks to the boot CPU
>before the scheduler has started, in other words, before it is legal
>to invoke rcu_barrier().
>
>So this commit instead 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-27 Thread Paul E. McKenney
On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote:
> On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
> > Paul E. McKenney  wrote:
> > 
> > >On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
> > >>  Looking at the dmesg, the early boot messages seem to be
> > >> confused as to how many CPUs there are, e.g.,
> > >> 
> > >> [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
> > >> [0.00] Hierarchical RCU implementation.
> > >> [0.00]  RCU debugfs-based tracing is enabled.
> > >> [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
> > >> [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
> > >> [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, 
> > >> nr_cpu_ids=4
> > >> [0.00] NR_IRQS:16640 nr_irqs:456 0
> > >> [0.00]  Offload RCU callbacks from all CPUs
> > >> [0.00]  Offload RCU callbacks from CPUs: 0-3.
> > >> 
> > >>  but later shows 2:
> > >> 
> > >> [0.233703] x86: Booting SMP configuration:
> > >> [0.236003]  node  #0, CPUs:  #1
> > >> [0.255528] x86: Booted up 1 node, 2 CPUs
> > >> 
> > >>  In any event, the E8400 is a 2 core CPU with no hyperthreading.
> > >
> > >Well, this might explain some of the difficulties.  If RCU decides to wait
> > >on CPUs that don't exist, we will of course get a hang.  And rcu_barrier()
> > >was definitely expecting four CPUs.
> > >
> > >So what happens if you boot with maxcpus=2?  (Or build with
> > >CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
> > >I might have some ideas for a real fix.
> > 
> > Booting with maxcpus=2 makes no difference (the dmesg output is
> > the same).
> > 
> > Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
> > dmesg has different CPU information at boot:
> > 
> > [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
> > [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
> >  [...]
> > [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
> > nr_node_ids:1
> >  [...]
> > [0.00] Hierarchical RCU implementation.
> > [0.00]  RCU debugfs-based tracing is enabled.
> > [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
> > [0.00] NR_IRQS:4352 nr_irqs:440 0
> > [0.00]  Offload RCU callbacks from all CPUs
> > [0.00]  Offload RCU callbacks from CPUs: 0-1.
> 
> Thank you -- this confirms my suspicions on the fix, though I must admit
> to being surprised that maxcpus made no difference.

And here is an alleged fix, lightly tested at this end.  Does this patch
help?

Thanx, Paul



rcu: Make rcu_barrier() understand about missing rcuo kthreads

Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
avoids creating rcuo kthreads for CPUs that never come online.  This
fixes a bug in many instances of firmware: Instead of lying about their
age, these systems instead lie about the number of CPUs that they have.
Before commit 35ce7f29a44a, this could result in huge numbers of useless
rcuo kthreads being created.

It appears that experience indicates that I should have told the
people suffering from this problem to fix their broken firmware, but
I instead produced what turned out to be a partial fix.   The missing
piece supplied by this commit makes sure that rcu_barrier() knows not to
post callbacks for no-CBs CPUs that have not yet come online, because
otherwise rcu_barrier() will hang on systems having firmware that lies
about the number of CPUs.

It is tempting to simply have rcu_barrier() refuse to post a callback on
any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
does not work because rcu_barrier() is required to wait for all pending
callbacks.  It is therefore required to wait even for those callbacks
that cannot possibly be invoked.  Even if doing so hangs the system.

Given that posting a callback to a no-CBs CPU that does not yet have an
rcuo kthread can hang rcu_barrier(), It is tempting to report an error
in this case.  Unfortunately, this will result in false positives at
boot time, when it is perfectly legal to post callbacks to the boot CPU
before the scheduler has started, in other words, before it is legal
to invoke rcu_barrier().

So this commit instead has rcu_barrier() avoid posting callbacks to
CPUs having neither rcuo kthread nor pending callbacks, and has it
complain bitterly if it finds CPUs having no rcuo kthread but some
pending callbacks.  And when rcu_barrier() does find CPUs having no rcuo
kthread but pending callbacks, as noted earlier, it has no choice but
to hang indefinitely.

Reported-by: Yanko Kaneti 
Reported-by: Jay Vosburgh 
Reported-by: Meelis Roos 
Reported-by: Eric B Munson 
Signed-off-by: Paul E. McKenney 

diff --git 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-27 Thread Paul E. McKenney
On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote:
 On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
  Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
  
  On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
Looking at the dmesg, the early boot messages seem to be
   confused as to how many CPUs there are, e.g.,
   
   [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
   [0.00] Hierarchical RCU implementation.
   [0.00]  RCU debugfs-based tracing is enabled.
   [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
   [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
   [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, 
   nr_cpu_ids=4
   [0.00] NR_IRQS:16640 nr_irqs:456 0
   [0.00]  Offload RCU callbacks from all CPUs
   [0.00]  Offload RCU callbacks from CPUs: 0-3.
   
but later shows 2:
   
   [0.233703] x86: Booting SMP configuration:
   [0.236003]  node  #0, CPUs:  #1
   [0.255528] x86: Booted up 1 node, 2 CPUs
   
In any event, the E8400 is a 2 core CPU with no hyperthreading.
  
  Well, this might explain some of the difficulties.  If RCU decides to wait
  on CPUs that don't exist, we will of course get a hang.  And rcu_barrier()
  was definitely expecting four CPUs.
  
  So what happens if you boot with maxcpus=2?  (Or build with
  CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
  I might have some ideas for a real fix.
  
  Booting with maxcpus=2 makes no difference (the dmesg output is
  the same).
  
  Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
  dmesg has different CPU information at boot:
  
  [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
  [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
   [...]
  [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
  nr_node_ids:1
   [...]
  [0.00] Hierarchical RCU implementation.
  [0.00]  RCU debugfs-based tracing is enabled.
  [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
  [0.00] NR_IRQS:4352 nr_irqs:440 0
  [0.00]  Offload RCU callbacks from all CPUs
  [0.00]  Offload RCU callbacks from CPUs: 0-1.
 
 Thank you -- this confirms my suspicions on the fix, though I must admit
 to being surprised that maxcpus made no difference.

And here is an alleged fix, lightly tested at this end.  Does this patch
help?

Thanx, Paul



rcu: Make rcu_barrier() understand about missing rcuo kthreads

Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
avoids creating rcuo kthreads for CPUs that never come online.  This
fixes a bug in many instances of firmware: Instead of lying about their
age, these systems instead lie about the number of CPUs that they have.
Before commit 35ce7f29a44a, this could result in huge numbers of useless
rcuo kthreads being created.

It appears that experience indicates that I should have told the
people suffering from this problem to fix their broken firmware, but
I instead produced what turned out to be a partial fix.   The missing
piece supplied by this commit makes sure that rcu_barrier() knows not to
post callbacks for no-CBs CPUs that have not yet come online, because
otherwise rcu_barrier() will hang on systems having firmware that lies
about the number of CPUs.

It is tempting to simply have rcu_barrier() refuse to post a callback on
any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
does not work because rcu_barrier() is required to wait for all pending
callbacks.  It is therefore required to wait even for those callbacks
that cannot possibly be invoked.  Even if doing so hangs the system.

Given that posting a callback to a no-CBs CPU that does not yet have an
rcuo kthread can hang rcu_barrier(), It is tempting to report an error
in this case.  Unfortunately, this will result in false positives at
boot time, when it is perfectly legal to post callbacks to the boot CPU
before the scheduler has started, in other words, before it is legal
to invoke rcu_barrier().

So this commit instead has rcu_barrier() avoid posting callbacks to
CPUs having neither rcuo kthread nor pending callbacks, and has it
complain bitterly if it finds CPUs having no rcuo kthread but some
pending callbacks.  And when rcu_barrier() does find CPUs having no rcuo
kthread but pending callbacks, as noted earlier, it has no choice but
to hang indefinitely.

Reported-by: Yanko Kaneti yan...@declera.com
Reported-by: Jay Vosburgh jay.vosbu...@canonical.com
Reported-by: Meelis Roos mr...@linux.ee
Reported-by: Eric B Munson emun...@akamai.com
Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/include/trace/events/rcu.h 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-27 Thread Jay Vosburgh
Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote:
 On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
  Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
  
  On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
   Looking at the dmesg, the early boot messages seem to be
   confused as to how many CPUs there are, e.g.,
   
   [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, 
   Nodes=1
   [0.00] Hierarchical RCU implementation.
   [0.00]  RCU debugfs-based tracing is enabled.
   [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
   [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
   [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, 
   nr_cpu_ids=4
   [0.00] NR_IRQS:16640 nr_irqs:456 0
   [0.00]  Offload RCU callbacks from all CPUs
   [0.00]  Offload RCU callbacks from CPUs: 0-3.
   
   but later shows 2:
   
   [0.233703] x86: Booting SMP configuration:
   [0.236003]  node  #0, CPUs:  #1
   [0.255528] x86: Booted up 1 node, 2 CPUs
   
   In any event, the E8400 is a 2 core CPU with no hyperthreading.
  
  Well, this might explain some of the difficulties.  If RCU decides to wait
  on CPUs that don't exist, we will of course get a hang.  And rcu_barrier()
  was definitely expecting four CPUs.
  
  So what happens if you boot with maxcpus=2?  (Or build with
  CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
  I might have some ideas for a real fix.
  
 Booting with maxcpus=2 makes no difference (the dmesg output is
  the same).
  
 Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
  dmesg has different CPU information at boot:
  
  [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
  [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
   [...]
  [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
  nr_node_ids:1
   [...]
  [0.00] Hierarchical RCU implementation.
  [0.00] RCU debugfs-based tracing is enabled.
  [0.00] RCU dyntick-idle grace-period acceleration is enabled.
  [0.00] NR_IRQS:4352 nr_irqs:440 0
  [0.00] Offload RCU callbacks from all CPUs
  [0.00] Offload RCU callbacks from CPUs: 0-1.
 
 Thank you -- this confirms my suspicions on the fix, though I must admit
 to being surprised that maxcpus made no difference.

And here is an alleged fix, lightly tested at this end.  Does this patch
help?

This patch appears to make the problem go away; I've run about
10 iterations.  I applied this patch to the same -net tree I was using
previously (-net as of Oct 22), with all other test patches removed.

FWIW, dmesg is unchanged, and still shows messages like:

[0.00]  Offload RCU callbacks from CPUs: 0-3.

Tested-by: Jay Vosburgh jay.vosbu...@canonical.com

-J

   Thanx, Paul



rcu: Make rcu_barrier() understand about missing rcuo kthreads

Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
avoids creating rcuo kthreads for CPUs that never come online.  This
fixes a bug in many instances of firmware: Instead of lying about their
age, these systems instead lie about the number of CPUs that they have.
Before commit 35ce7f29a44a, this could result in huge numbers of useless
rcuo kthreads being created.

It appears that experience indicates that I should have told the
people suffering from this problem to fix their broken firmware, but
I instead produced what turned out to be a partial fix.   The missing
piece supplied by this commit makes sure that rcu_barrier() knows not to
post callbacks for no-CBs CPUs that have not yet come online, because
otherwise rcu_barrier() will hang on systems having firmware that lies
about the number of CPUs.

It is tempting to simply have rcu_barrier() refuse to post a callback on
any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
does not work because rcu_barrier() is required to wait for all pending
callbacks.  It is therefore required to wait even for those callbacks
that cannot possibly be invoked.  Even if doing so hangs the system.

Given that posting a callback to a no-CBs CPU that does not yet have an
rcuo kthread can hang rcu_barrier(), It is tempting to report an error
in this case.  Unfortunately, this will result in false positives at
boot time, when it is perfectly legal to post callbacks to the boot CPU
before the scheduler has started, in other words, before it is legal
to invoke rcu_barrier().

So this commit instead has rcu_barrier() avoid posting callbacks to
CPUs having neither rcuo kthread nor pending callbacks, and has it
complain bitterly if it finds CPUs having no rcuo kthread but some

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-27 Thread Paul E. McKenney
On Mon, Oct 27, 2014 at 01:43:21PM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote:
  On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
   Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
   
   On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
  Looking at the dmesg, the early boot messages seem to be
confused as to how many CPUs there are, e.g.,

[0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, 
Nodes=1
[0.00] Hierarchical RCU implementation.
[0.00]  RCU debugfs-based tracing is enabled.
[0.00]  RCU dyntick-idle grace-period acceleration is enabled.
[0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
[0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, 
nr_cpu_ids=4
[0.00] NR_IRQS:16640 nr_irqs:456 0
[0.00]  Offload RCU callbacks from all CPUs
[0.00]  Offload RCU callbacks from CPUs: 0-3.

  but later shows 2:

[0.233703] x86: Booting SMP configuration:
[0.236003]  node  #0, CPUs:  #1
[0.255528] x86: Booted up 1 node, 2 CPUs

  In any event, the E8400 is a 2 core CPU with no hyperthreading.
   
   Well, this might explain some of the difficulties.  If RCU decides to 
   wait
   on CPUs that don't exist, we will of course get a hang.  And 
   rcu_barrier()
   was definitely expecting four CPUs.
   
   So what happens if you boot with maxcpus=2?  (Or build with
   CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
   I might have some ideas for a real fix.
   
Booting with maxcpus=2 makes no difference (the dmesg output is
   the same).
   
Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
   dmesg has different CPU information at boot:
   
   [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
   [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
[...]
   [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
   nr_node_ids:1
[...]
   [0.00] Hierarchical RCU implementation.
   [0.00]   RCU debugfs-based tracing is enabled.
   [0.00]   RCU dyntick-idle grace-period acceleration is enabled.
   [0.00] NR_IRQS:4352 nr_irqs:440 0
   [0.00]   Offload RCU callbacks from all CPUs
   [0.00]   Offload RCU callbacks from CPUs: 0-1.
  
  Thank you -- this confirms my suspicions on the fix, though I must admit
  to being surprised that maxcpus made no difference.
 
 And here is an alleged fix, lightly tested at this end.  Does this patch
 help?
 
   This patch appears to make the problem go away; I've run about
 10 iterations.  I applied this patch to the same -net tree I was using
 previously (-net as of Oct 22), with all other test patches removed.

So I finally produced a patch that helps!  It was bound to happen sooner
or later, I guess.  ;-)

   FWIW, dmesg is unchanged, and still shows messages like:
 
 [0.00]  Offload RCU callbacks from CPUs: 0-3.

Yep, at that point in boot, RCU has no way of knowing that the firmware
is lying to it about the number of CPUs.  ;-)

 Tested-by: Jay Vosburgh jay.vosbu...@canonical.com

Thank you for your testing efforts!!!

Thanx, Paul

   -J
 
 
 
 rcu: Make rcu_barrier() understand about missing rcuo kthreads
 
 Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
 avoids creating rcuo kthreads for CPUs that never come online.  This
 fixes a bug in many instances of firmware: Instead of lying about their
 age, these systems instead lie about the number of CPUs that they have.
 Before commit 35ce7f29a44a, this could result in huge numbers of useless
 rcuo kthreads being created.
 
 It appears that experience indicates that I should have told the
 people suffering from this problem to fix their broken firmware, but
 I instead produced what turned out to be a partial fix.   The missing
 piece supplied by this commit makes sure that rcu_barrier() knows not to
 post callbacks for no-CBs CPUs that have not yet come online, because
 otherwise rcu_barrier() will hang on systems having firmware that lies
 about the number of CPUs.
 
 It is tempting to simply have rcu_barrier() refuse to post a callback on
 any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
 does not work because rcu_barrier() is required to wait for all pending
 callbacks.  It is therefore required to wait even for those callbacks
 that cannot possibly be invoked.  Even if doing so hangs the system.
 
 Given that posting a callback to a no-CBs CPU that does not yet have an
 rcuo kthread can hang rcu_barrier(), It is tempting to report an error
 in this case.  Unfortunately, this 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
> Paul E. McKenney  wrote:
> 
> >On Fri, Oct 24, 2014 at 05:20:48PM -0700, Jay Vosburgh wrote:
> >> Paul E. McKenney  wrote:
> >> 
> >> >On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
> >> [...]
> >> >> Hmmm...  It sure looks like we have some callbacks stuck here.  I 
> >> >> clearly
> >> >> need to take a hard look at the sleep/wakeup code.
> >> >> 
> >> >> Thank you for running this!!!
> >> >
> >> >Could you please try the following patch?  If no joy, could you please
> >> >add rcu:rcu_nocb_wake to the list of ftrace events?
> >> 
> >>I tried the patch, it did not change the behavior.
> >> 
> >>I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints
> >> and ran it again (with this patch and the first patch from earlier
> >> today); the trace output is a bit on the large side so I put it and the
> >> dmesg log at:
> >> 
> >> http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt
> >> 
> >> http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt
> >
> >Thank you again!
> >
> >Very strange part of the trace.  The only sign of CPU 2 and 3 are:
> >
> >ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Begin 
> > cpu -1 remaining 0 # 0
> >ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Check 
> > cpu -1 remaining 0 # 0
> >ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched Inc1 
> > cpu -1 remaining 0 # 1
> >ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
> > OnlineNoCB cpu 0 remaining 1 # 1
> >ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 0 
> > WakeNot
> >ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
> > OnlineNoCB cpu 1 remaining 2 # 1
> >ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 1 
> > WakeNot
> >ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
> > OnlineNoCB cpu 2 remaining 3 # 1
> >ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 2 
> > WakeNotPoll
> >ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
> > OnlineNoCB cpu 3 remaining 4 # 1
> >ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 3 
> > WakeNotPoll
> >ovs-vswitchd-902   [000]    109.896843: rcu_barrier: rcu_sched Inc2 
> > cpu -1 remaining 4 # 2
> >
> >The pair of WakeNotPoll trace entries says that at that point, RCU believed
> >that the CPU 2's and CPU 3's rcuo kthreads did not exist.  :-/
> 
>   On the test system I'm using, CPUs 2 and 3 really do not exist;
> it is a 2 CPU system (Intel Core 2 Duo E8400). I mentioned this in an
> earlier message, but perhaps you missed it in the flurry.

Or forgot it.  Either way, thank you for reminding me.

>   Looking at the dmesg, the early boot messages seem to be
> confused as to how many CPUs there are, e.g.,
> 
> [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
> [0.00] Hierarchical RCU implementation.
> [0.00]  RCU debugfs-based tracing is enabled.
> [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
> [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
> [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
> [0.00] NR_IRQS:16640 nr_irqs:456 0
> [0.00]  Offload RCU callbacks from all CPUs
> [0.00]  Offload RCU callbacks from CPUs: 0-3.
> 
>   but later shows 2:
> 
> [0.233703] x86: Booting SMP configuration:
> [0.236003]  node  #0, CPUs:  #1
> [0.255528] x86: Booted up 1 node, 2 CPUs
> 
>   In any event, the E8400 is a 2 core CPU with no hyperthreading.

Well, this might explain some of the difficulties.  If RCU decides to wait
on CPUs that don't exist, we will of course get a hang.  And rcu_barrier()
was definitely expecting four CPUs.

So what happens if you boot with maxcpus=2?  (Or build with
CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
I might have some ideas for a real fix.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Yanko Kaneti
On Fri-10/24/14-2014 14:49, Paul E. McKenney wrote:
> On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
> > On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
> > > On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
> > > > On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
> 
> [ . . . ]
> 
> > > > > Well, if you are feeling aggressive, give the following patch a spin.
> > > > > I am doing sanity tests on it in the meantime.
> > > > 
> > > > Doesn't seem to make a difference here
> > > 
> > > OK, inspection isn't cutting it, so time for tracing.  Does the system
> > > respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
> > > the problem occurs, then dump the trace buffer after the problem occurs.
> > 
> > Sorry for being unresposive here, but I know next to nothing about tracing
> > or most things about the kernel, so I have some cathing up to do.
> > 
> > In the meantime some layman observations while I tried to find what exactly
> > triggers the problem.
> > - Even in runlevel 1 I can reliably trigger the problem by starting libvirtd
> > - libvirtd seems to be very active in using all sorts of kernel facilities
> >   that are modules on fedora so it seems to cause many simultaneous kworker 
> >   calls to modprobe
> > - there are 8 kworker/u16 from 0 to 7
> > - one of these kworkers always deadlocks, while there appear to be two
> >   kworker/u16:6 - the seventh
> 
> Adding Tejun on CC in case this duplication of kworker/u16:6 is important.
> 
> >   6 vs 8 as in 6 rcuos where before they were always 8
> > 
> > Just observations from someone who still doesn't know what the u16
> > kworkers are..
> 
> Could you please run the following diagnostic patch?  This will help
> me see if I have managed to miswire the rcuo kthreads.  It should
> print some information at task-hang time.

So here the output with todays linux tip and the diagnostic patch.
This is the case with just starting libvird in runlevel 1.
Also a snapshot  of the kworker/u16 s

6 ?S  0:00  \_ [kworker/u16:0]
  553 ?S  0:00  |   \_ [kworker/u16:0]
  554 ?D  0:00  |   \_ /sbin/modprobe -q -- bridge
   78 ?S  0:00  \_ [kworker/u16:1]
   92 ?S  0:00  \_ [kworker/u16:2]
   93 ?S  0:00  \_ [kworker/u16:3]
   94 ?S  0:00  \_ [kworker/u16:4]
   95 ?S  0:00  \_ [kworker/u16:5]
   96 ?D  0:00  \_ [kworker/u16:6]
  105 ?S  0:00  \_ [kworker/u16:7]
  108 ?S  0:00  \_ [kworker/u16:8]


INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
  Not tainted 3.18.0-rc1+ #16
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:6   D 8800ca9ecec0 1155296  2 0x
Workqueue: netns cleanup_net
 880221fff9c8 0096 8800ca9ecec0 001d5f00
 880221d8 001d5f00 88022326 8800ca9ecec0
 82c44010 7fff 81ee3798 81ee3790
Call Trace:
 [] schedule+0x29/0x70
 [] schedule_timeout+0x26c/0x410
 [] ? native_sched_clock+0x2a/0xa0
 [] ? mark_held_locks+0x7c/0xb0
 [] ? _raw_spin_unlock_irq+0x30/0x50
 [] ? trace_hardirqs_on_caller+0x15d/0x200
 [] wait_for_completion+0x10c/0x150
 [] ? wake_up_state+0x20/0x20
 [] _rcu_barrier+0x677/0xcd0
 [] rcu_barrier+0x15/0x20
 [] netdev_run_todo+0x6f/0x310
 [] ? rollback_registered_many+0x265/0x2e0
 [] rtnl_unlock+0xe/0x10
 [] default_device_exit_batch+0x156/0x180
 [] ? abort_exclusive_wait+0xb0/0xb0
 [] ops_exit_list.isra.1+0x53/0x60
 [] cleanup_net+0x100/0x1f0
 [] process_one_work+0x218/0x850
 [] ? process_one_work+0x17f/0x850
 [] ? worker_thread+0xe7/0x4a0
 [] worker_thread+0x6b/0x4a0
 [] ? process_one_work+0x850/0x850
 [] kthread+0x10b/0x130
 [] ? sched_clock+0x9/0x10
 [] ? kthread_create_on_node+0x250/0x250
 [] ret_from_fork+0x7c/0xb0
 [] ? kthread_create_on_node+0x250/0x250
4 locks held by kworker/u16:6/96:
 #0:  ("%s""netns"){.+.+.+}, at: []
 #process_one_work+0x17f/0x850
 #1:  (net_cleanup_work){+.+.+.}, at: []
 #process_one_work+0x17f/0x850
 #2:  (net_mutex){+.+.+.}, at: [] cleanup_net+0x8c/0x1f0
 #3:  (rcu_sched_state.barrier_mutex){+.+...}, at: []
 #_rcu_barrier+0x75/0xcd0
rcu_show_nocb_setup(): rcu_sched nocb state:
  0: 8802267ced40 l:8802267ced40 n:8802269ced40 .G.
  1: 8802269ced40 l:8802267ced40 n:  (null) ...
  2: 880226bced40 l:880226bced40 n:880226dced40 .G.
  3: 880226dced40 l:880226bced40 n:  (null) N..
  4: 880226fced40 l:880226fced40 n:8802271ced40 .G.
  5: 8802271ced40 l:880226fced40 n:  (null) ...
  6: 8802273ced40 l:8802273ced40 n:8802275ced40 N..
  7: 8802275ced40 l:8802273ced40 n:  (null) N..
rcu_show_nocb_setup(): rcu_bh nocb state:
  0: 8802267ceac0 l:8802267ceac0 n:8802269ceac0 ...
  1: 8802269ceac0 l:8802267ceac0 n:  (null) ...
  2: 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Paul E. McKenney
On Sat, Oct 25, 2014 at 03:09:36PM +0300, Yanko Kaneti wrote:
> On Fri-10/24/14-2014 14:49, Paul E. McKenney wrote:
> > On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
> > > On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
> > > > On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
> > > > > On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
> > 
> > [ . . . ]
> > 
> > > > > > Well, if you are feeling aggressive, give the following patch a 
> > > > > > spin.
> > > > > > I am doing sanity tests on it in the meantime.
> > > > > 
> > > > > Doesn't seem to make a difference here
> > > > 
> > > > OK, inspection isn't cutting it, so time for tracing.  Does the system
> > > > respond to user input?  If so, please enable rcu:rcu_barrier ftrace 
> > > > before
> > > > the problem occurs, then dump the trace buffer after the problem occurs.
> > > 
> > > Sorry for being unresposive here, but I know next to nothing about tracing
> > > or most things about the kernel, so I have some cathing up to do.
> > > 
> > > In the meantime some layman observations while I tried to find what 
> > > exactly
> > > triggers the problem.
> > > - Even in runlevel 1 I can reliably trigger the problem by starting 
> > > libvirtd
> > > - libvirtd seems to be very active in using all sorts of kernel facilities
> > >   that are modules on fedora so it seems to cause many simultaneous 
> > > kworker 
> > >   calls to modprobe
> > > - there are 8 kworker/u16 from 0 to 7
> > > - one of these kworkers always deadlocks, while there appear to be two
> > >   kworker/u16:6 - the seventh
> > 
> > Adding Tejun on CC in case this duplication of kworker/u16:6 is important.
> > 
> > >   6 vs 8 as in 6 rcuos where before they were always 8
> > > 
> > > Just observations from someone who still doesn't know what the u16
> > > kworkers are..
> > 
> > Could you please run the following diagnostic patch?  This will help
> > me see if I have managed to miswire the rcuo kthreads.  It should
> > print some information at task-hang time.
> 
> So here the output with todays linux tip and the diagnostic patch.
> This is the case with just starting libvird in runlevel 1.

Thank you for testing this!

> Also a snapshot  of the kworker/u16 s
> 
> 6 ?S  0:00  \_ [kworker/u16:0]
>   553 ?S  0:00  |   \_ [kworker/u16:0]
>   554 ?D  0:00  |   \_ /sbin/modprobe -q -- bridge
>78 ?S  0:00  \_ [kworker/u16:1]
>92 ?S  0:00  \_ [kworker/u16:2]
>93 ?S  0:00  \_ [kworker/u16:3]
>94 ?S  0:00  \_ [kworker/u16:4]
>95 ?S  0:00  \_ [kworker/u16:5]
>96 ?D  0:00  \_ [kworker/u16:6]
>   105 ?S  0:00  \_ [kworker/u16:7]
>   108 ?S  0:00  \_ [kworker/u16:8]

You had six CPUs, IIRC, so the last two kworker/u16 kthreads are surplus
to requirements.  Not sure if they are causing any trouble, though.

> INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
>   Not tainted 3.18.0-rc1+ #16
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/u16:6   D 8800ca9ecec0 1155296  2 0x
> Workqueue: netns cleanup_net
>  880221fff9c8 0096 8800ca9ecec0 001d5f00
>  880221d8 001d5f00 88022326 8800ca9ecec0
>  82c44010 7fff 81ee3798 81ee3790
> Call Trace:
>  [] schedule+0x29/0x70
>  [] schedule_timeout+0x26c/0x410
>  [] ? native_sched_clock+0x2a/0xa0
>  [] ? mark_held_locks+0x7c/0xb0
>  [] ? _raw_spin_unlock_irq+0x30/0x50
>  [] ? trace_hardirqs_on_caller+0x15d/0x200
>  [] wait_for_completion+0x10c/0x150
>  [] ? wake_up_state+0x20/0x20
>  [] _rcu_barrier+0x677/0xcd0
>  [] rcu_barrier+0x15/0x20
>  [] netdev_run_todo+0x6f/0x310
>  [] ? rollback_registered_many+0x265/0x2e0
>  [] rtnl_unlock+0xe/0x10
>  [] default_device_exit_batch+0x156/0x180
>  [] ? abort_exclusive_wait+0xb0/0xb0
>  [] ops_exit_list.isra.1+0x53/0x60
>  [] cleanup_net+0x100/0x1f0
>  [] process_one_work+0x218/0x850
>  [] ? process_one_work+0x17f/0x850
>  [] ? worker_thread+0xe7/0x4a0
>  [] worker_thread+0x6b/0x4a0
>  [] ? process_one_work+0x850/0x850
>  [] kthread+0x10b/0x130
>  [] ? sched_clock+0x9/0x10
>  [] ? kthread_create_on_node+0x250/0x250
>  [] ret_from_fork+0x7c/0xb0
>  [] ? kthread_create_on_node+0x250/0x250
> 4 locks held by kworker/u16:6/96:
>  #0:  ("%s""netns"){.+.+.+}, at: []
>  #process_one_work+0x17f/0x850
>  #1:  (net_cleanup_work){+.+.+.}, at: []
>  #process_one_work+0x17f/0x850
>  #2:  (net_mutex){+.+.+.}, at: [] cleanup_net+0x8c/0x1f0
>  #3:  (rcu_sched_state.barrier_mutex){+.+...}, at: []
>  #_rcu_barrier+0x75/0xcd0
> rcu_show_nocb_setup(): rcu_sched nocb state:
>   0: 8802267ced40 l:8802267ced40 n:8802269ced40 .G.
>   1: 8802269ced40 l:8802267ced40 n:  (null) ...
>   2: 880226bced40 l:880226bced40 n:880226dced40 .G.
>   3: 880226dced40 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Jay Vosburgh
Paul E. McKenney  wrote:

>On Fri, Oct 24, 2014 at 05:20:48PM -0700, Jay Vosburgh wrote:
>> Paul E. McKenney  wrote:
>> 
>> >On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
>> [...]
>> >> Hmmm...  It sure looks like we have some callbacks stuck here.  I clearly
>> >> need to take a hard look at the sleep/wakeup code.
>> >> 
>> >> Thank you for running this!!!
>> >
>> >Could you please try the following patch?  If no joy, could you please
>> >add rcu:rcu_nocb_wake to the list of ftrace events?
>> 
>>  I tried the patch, it did not change the behavior.
>> 
>>  I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints
>> and ran it again (with this patch and the first patch from earlier
>> today); the trace output is a bit on the large side so I put it and the
>> dmesg log at:
>> 
>> http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt
>> 
>> http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt
>
>Thank you again!
>
>Very strange part of the trace.  The only sign of CPU 2 and 3 are:
>
>ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Begin 
> cpu -1 remaining 0 # 0
>ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Check 
> cpu -1 remaining 0 # 0
>ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched Inc1 
> cpu -1 remaining 0 # 1
>ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
> OnlineNoCB cpu 0 remaining 1 # 1
>ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 0 
> WakeNot
>ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
> OnlineNoCB cpu 1 remaining 2 # 1
>ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 1 
> WakeNot
>ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
> OnlineNoCB cpu 2 remaining 3 # 1
>ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 2 
> WakeNotPoll
>ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
> OnlineNoCB cpu 3 remaining 4 # 1
>ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 3 
> WakeNotPoll
>ovs-vswitchd-902   [000]    109.896843: rcu_barrier: rcu_sched Inc2 
> cpu -1 remaining 4 # 2
>
>The pair of WakeNotPoll trace entries says that at that point, RCU believed
>that the CPU 2's and CPU 3's rcuo kthreads did not exist.  :-/

On the test system I'm using, CPUs 2 and 3 really do not exist;
it is a 2 CPU system (Intel Core 2 Duo E8400). I mentioned this in an
earlier message, but perhaps you missed it in the flurry.

Looking at the dmesg, the early boot messages seem to be
confused as to how many CPUs there are, e.g.,

[0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[0.00] Hierarchical RCU implementation.
[0.00]  RCU debugfs-based tracing is enabled.
[0.00]  RCU dyntick-idle grace-period acceleration is enabled.
[0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
[0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
[0.00] NR_IRQS:16640 nr_irqs:456 0
[0.00]  Offload RCU callbacks from all CPUs
[0.00]  Offload RCU callbacks from CPUs: 0-3.

but later shows 2:

[0.233703] x86: Booting SMP configuration:
[0.236003]  node  #0, CPUs:  #1
[0.255528] x86: Booted up 1 node, 2 CPUs

In any event, the E8400 is a 2 core CPU with no hyperthreading.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Jay Vosburgh
Paul E. McKenney  wrote:

>On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
>>  Looking at the dmesg, the early boot messages seem to be
>> confused as to how many CPUs there are, e.g.,
>> 
>> [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
>> [0.00] Hierarchical RCU implementation.
>> [0.00]  RCU debugfs-based tracing is enabled.
>> [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
>> [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
>> [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
>> [0.00] NR_IRQS:16640 nr_irqs:456 0
>> [0.00]  Offload RCU callbacks from all CPUs
>> [0.00]  Offload RCU callbacks from CPUs: 0-3.
>> 
>>  but later shows 2:
>> 
>> [0.233703] x86: Booting SMP configuration:
>> [0.236003]  node  #0, CPUs:  #1
>> [0.255528] x86: Booted up 1 node, 2 CPUs
>> 
>>  In any event, the E8400 is a 2 core CPU with no hyperthreading.
>
>Well, this might explain some of the difficulties.  If RCU decides to wait
>on CPUs that don't exist, we will of course get a hang.  And rcu_barrier()
>was definitely expecting four CPUs.
>
>So what happens if you boot with maxcpus=2?  (Or build with
>CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
>I might have some ideas for a real fix.

Booting with maxcpus=2 makes no difference (the dmesg output is
the same).

Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
dmesg has different CPU information at boot:

[0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
[0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
 [...]
[0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
nr_node_ids:1
 [...]
[0.00] Hierarchical RCU implementation.
[0.00]  RCU debugfs-based tracing is enabled.
[0.00]  RCU dyntick-idle grace-period acceleration is enabled.
[0.00] NR_IRQS:4352 nr_irqs:440 0
[0.00]  Offload RCU callbacks from all CPUs
[0.00]  Offload RCU callbacks from CPUs: 0-1.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Paul E. McKenney
On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
> Paul E. McKenney  wrote:
> 
> >On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
> >>Looking at the dmesg, the early boot messages seem to be
> >> confused as to how many CPUs there are, e.g.,
> >> 
> >> [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
> >> [0.00] Hierarchical RCU implementation.
> >> [0.00]  RCU debugfs-based tracing is enabled.
> >> [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
> >> [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
> >> [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
> >> [0.00] NR_IRQS:16640 nr_irqs:456 0
> >> [0.00]  Offload RCU callbacks from all CPUs
> >> [0.00]  Offload RCU callbacks from CPUs: 0-3.
> >> 
> >>but later shows 2:
> >> 
> >> [0.233703] x86: Booting SMP configuration:
> >> [0.236003]  node  #0, CPUs:  #1
> >> [0.255528] x86: Booted up 1 node, 2 CPUs
> >> 
> >>In any event, the E8400 is a 2 core CPU with no hyperthreading.
> >
> >Well, this might explain some of the difficulties.  If RCU decides to wait
> >on CPUs that don't exist, we will of course get a hang.  And rcu_barrier()
> >was definitely expecting four CPUs.
> >
> >So what happens if you boot with maxcpus=2?  (Or build with
> >CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
> >I might have some ideas for a real fix.
> 
>   Booting with maxcpus=2 makes no difference (the dmesg output is
> the same).
> 
>   Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
> dmesg has different CPU information at boot:
> 
> [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
> [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
>  [...]
> [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
> nr_node_ids:1
>  [...]
> [0.00] Hierarchical RCU implementation.
> [0.00]RCU debugfs-based tracing is enabled.
> [0.00]RCU dyntick-idle grace-period acceleration is enabled.
> [0.00] NR_IRQS:4352 nr_irqs:440 0
> [0.00]Offload RCU callbacks from all CPUs
> [0.00]Offload RCU callbacks from CPUs: 0-1.

Thank you -- this confirms my suspicions on the fix, though I must admit
to being surprised that maxcpus made no difference.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Paul E. McKenney
On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
 Looking at the dmesg, the early boot messages seem to be
  confused as to how many CPUs there are, e.g.,
  
  [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
  [0.00] Hierarchical RCU implementation.
  [0.00]  RCU debugfs-based tracing is enabled.
  [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
  [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
  [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
  [0.00] NR_IRQS:16640 nr_irqs:456 0
  [0.00]  Offload RCU callbacks from all CPUs
  [0.00]  Offload RCU callbacks from CPUs: 0-3.
  
 but later shows 2:
  
  [0.233703] x86: Booting SMP configuration:
  [0.236003]  node  #0, CPUs:  #1
  [0.255528] x86: Booted up 1 node, 2 CPUs
  
 In any event, the E8400 is a 2 core CPU with no hyperthreading.
 
 Well, this might explain some of the difficulties.  If RCU decides to wait
 on CPUs that don't exist, we will of course get a hang.  And rcu_barrier()
 was definitely expecting four CPUs.
 
 So what happens if you boot with maxcpus=2?  (Or build with
 CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
 I might have some ideas for a real fix.
 
   Booting with maxcpus=2 makes no difference (the dmesg output is
 the same).
 
   Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
 dmesg has different CPU information at boot:
 
 [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
 [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
  [...]
 [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
 nr_node_ids:1
  [...]
 [0.00] Hierarchical RCU implementation.
 [0.00]RCU debugfs-based tracing is enabled.
 [0.00]RCU dyntick-idle grace-period acceleration is enabled.
 [0.00] NR_IRQS:4352 nr_irqs:440 0
 [0.00]Offload RCU callbacks from all CPUs
 [0.00]Offload RCU callbacks from CPUs: 0-1.

Thank you -- this confirms my suspicions on the fix, though I must admit
to being surprised that maxcpus made no difference.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Jay Vosburgh
Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
  Looking at the dmesg, the early boot messages seem to be
 confused as to how many CPUs there are, e.g.,
 
 [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
 [0.00] Hierarchical RCU implementation.
 [0.00]  RCU debugfs-based tracing is enabled.
 [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
 [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
 [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
 [0.00] NR_IRQS:16640 nr_irqs:456 0
 [0.00]  Offload RCU callbacks from all CPUs
 [0.00]  Offload RCU callbacks from CPUs: 0-3.
 
  but later shows 2:
 
 [0.233703] x86: Booting SMP configuration:
 [0.236003]  node  #0, CPUs:  #1
 [0.255528] x86: Booted up 1 node, 2 CPUs
 
  In any event, the E8400 is a 2 core CPU with no hyperthreading.

Well, this might explain some of the difficulties.  If RCU decides to wait
on CPUs that don't exist, we will of course get a hang.  And rcu_barrier()
was definitely expecting four CPUs.

So what happens if you boot with maxcpus=2?  (Or build with
CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
I might have some ideas for a real fix.

Booting with maxcpus=2 makes no difference (the dmesg output is
the same).

Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
dmesg has different CPU information at boot:

[0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
[0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
 [...]
[0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
nr_node_ids:1
 [...]
[0.00] Hierarchical RCU implementation.
[0.00]  RCU debugfs-based tracing is enabled.
[0.00]  RCU dyntick-idle grace-period acceleration is enabled.
[0.00] NR_IRQS:4352 nr_irqs:440 0
[0.00]  Offload RCU callbacks from all CPUs
[0.00]  Offload RCU callbacks from CPUs: 0-1.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Jay Vosburgh
Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Fri, Oct 24, 2014 at 05:20:48PM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
 [...]
  Hmmm...  It sure looks like we have some callbacks stuck here.  I clearly
  need to take a hard look at the sleep/wakeup code.
  
  Thank you for running this!!!
 
 Could you please try the following patch?  If no joy, could you please
 add rcu:rcu_nocb_wake to the list of ftrace events?
 
  I tried the patch, it did not change the behavior.
 
  I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints
 and ran it again (with this patch and the first patch from earlier
 today); the trace output is a bit on the large side so I put it and the
 dmesg log at:
 
 http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt
 
 http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt

Thank you again!

Very strange part of the trace.  The only sign of CPU 2 and 3 are:

ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Begin 
 cpu -1 remaining 0 # 0
ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Check 
 cpu -1 remaining 0 # 0
ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched Inc1 
 cpu -1 remaining 0 # 1
ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 0 remaining 1 # 1
ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 0 
 WakeNot
ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 1 remaining 2 # 1
ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 1 
 WakeNot
ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 2 remaining 3 # 1
ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 2 
 WakeNotPoll
ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 3 remaining 4 # 1
ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 3 
 WakeNotPoll
ovs-vswitchd-902   [000]    109.896843: rcu_barrier: rcu_sched Inc2 
 cpu -1 remaining 4 # 2

The pair of WakeNotPoll trace entries says that at that point, RCU believed
that the CPU 2's and CPU 3's rcuo kthreads did not exist.  :-/

On the test system I'm using, CPUs 2 and 3 really do not exist;
it is a 2 CPU system (Intel Core 2 Duo E8400). I mentioned this in an
earlier message, but perhaps you missed it in the flurry.

Looking at the dmesg, the early boot messages seem to be
confused as to how many CPUs there are, e.g.,

[0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[0.00] Hierarchical RCU implementation.
[0.00]  RCU debugfs-based tracing is enabled.
[0.00]  RCU dyntick-idle grace-period acceleration is enabled.
[0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
[0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
[0.00] NR_IRQS:16640 nr_irqs:456 0
[0.00]  Offload RCU callbacks from all CPUs
[0.00]  Offload RCU callbacks from CPUs: 0-3.

but later shows 2:

[0.233703] x86: Booting SMP configuration:
[0.236003]  node  #0, CPUs:  #1
[0.255528] x86: Booted up 1 node, 2 CPUs

In any event, the E8400 is a 2 core CPU with no hyperthreading.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Paul E. McKenney
On Sat, Oct 25, 2014 at 03:09:36PM +0300, Yanko Kaneti wrote:
 On Fri-10/24/14-2014 14:49, Paul E. McKenney wrote:
  On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
   On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
 On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
  
  [ . . . ]
  
  Well, if you are feeling aggressive, give the following patch a 
  spin.
  I am doing sanity tests on it in the meantime.
 
 Doesn't seem to make a difference here

OK, inspection isn't cutting it, so time for tracing.  Does the system
respond to user input?  If so, please enable rcu:rcu_barrier ftrace 
before
the problem occurs, then dump the trace buffer after the problem occurs.
   
   Sorry for being unresposive here, but I know next to nothing about tracing
   or most things about the kernel, so I have some cathing up to do.
   
   In the meantime some layman observations while I tried to find what 
   exactly
   triggers the problem.
   - Even in runlevel 1 I can reliably trigger the problem by starting 
   libvirtd
   - libvirtd seems to be very active in using all sorts of kernel facilities
 that are modules on fedora so it seems to cause many simultaneous 
   kworker 
 calls to modprobe
   - there are 8 kworker/u16 from 0 to 7
   - one of these kworkers always deadlocks, while there appear to be two
 kworker/u16:6 - the seventh
  
  Adding Tejun on CC in case this duplication of kworker/u16:6 is important.
  
 6 vs 8 as in 6 rcuos where before they were always 8
   
   Just observations from someone who still doesn't know what the u16
   kworkers are..
  
  Could you please run the following diagnostic patch?  This will help
  me see if I have managed to miswire the rcuo kthreads.  It should
  print some information at task-hang time.
 
 So here the output with todays linux tip and the diagnostic patch.
 This is the case with just starting libvird in runlevel 1.

Thank you for testing this!

 Also a snapshot  of the kworker/u16 s
 
 6 ?S  0:00  \_ [kworker/u16:0]
   553 ?S  0:00  |   \_ [kworker/u16:0]
   554 ?D  0:00  |   \_ /sbin/modprobe -q -- bridge
78 ?S  0:00  \_ [kworker/u16:1]
92 ?S  0:00  \_ [kworker/u16:2]
93 ?S  0:00  \_ [kworker/u16:3]
94 ?S  0:00  \_ [kworker/u16:4]
95 ?S  0:00  \_ [kworker/u16:5]
96 ?D  0:00  \_ [kworker/u16:6]
   105 ?S  0:00  \_ [kworker/u16:7]
   108 ?S  0:00  \_ [kworker/u16:8]

You had six CPUs, IIRC, so the last two kworker/u16 kthreads are surplus
to requirements.  Not sure if they are causing any trouble, though.

 INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
   Not tainted 3.18.0-rc1+ #16
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
 kworker/u16:6   D 8800ca9ecec0 1155296  2 0x
 Workqueue: netns cleanup_net
  880221fff9c8 0096 8800ca9ecec0 001d5f00
  880221d8 001d5f00 88022326 8800ca9ecec0
  82c44010 7fff 81ee3798 81ee3790
 Call Trace:
  [81866219] schedule+0x29/0x70
  [8186b43c] schedule_timeout+0x26c/0x410
  [81028bea] ? native_sched_clock+0x2a/0xa0
  [8110748c] ? mark_held_locks+0x7c/0xb0
  [8186c4c0] ? _raw_spin_unlock_irq+0x30/0x50
  [8110761d] ? trace_hardirqs_on_caller+0x15d/0x200
  [81867c4c] wait_for_completion+0x10c/0x150
  [810e4dc0] ? wake_up_state+0x20/0x20
  [81133627] _rcu_barrier+0x677/0xcd0
  [81133cd5] rcu_barrier+0x15/0x20
  [81720edf] netdev_run_todo+0x6f/0x310
  [81715aa5] ? rollback_registered_many+0x265/0x2e0
  [8172df4e] rtnl_unlock+0xe/0x10
  [81717906] default_device_exit_batch+0x156/0x180
  [810fd280] ? abort_exclusive_wait+0xb0/0xb0
  [8170f9b3] ops_exit_list.isra.1+0x53/0x60
  [81710560] cleanup_net+0x100/0x1f0
  [810cc988] process_one_work+0x218/0x850
  [810cc8ef] ? process_one_work+0x17f/0x850
  [810cd0a7] ? worker_thread+0xe7/0x4a0
  [810cd02b] worker_thread+0x6b/0x4a0
  [810ccfc0] ? process_one_work+0x850/0x850
  [810d337b] kthread+0x10b/0x130
  [81028c69] ? sched_clock+0x9/0x10
  [810d3270] ? kthread_create_on_node+0x250/0x250
  [8186d1fc] ret_from_fork+0x7c/0xb0
  [810d3270] ? kthread_create_on_node+0x250/0x250
 4 locks held by kworker/u16:6/96:
  #0:  (%snetns){.+.+.+}, at: [810cc8ef]
  #process_one_work+0x17f/0x850
  #1:  (net_cleanup_work){+.+.+.}, at: [810cc8ef]
  #process_one_work+0x17f/0x850
  #2:  (net_mutex){+.+.+.}, at: [817104ec] cleanup_net+0x8c/0x1f0
  #3:  (rcu_sched_state.barrier_mutex){+.+...}, at: [81133025]
  

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Yanko Kaneti
On Fri-10/24/14-2014 14:49, Paul E. McKenney wrote:
 On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
  On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
   On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
 
 [ . . . ]
 
 Well, if you are feeling aggressive, give the following patch a spin.
 I am doing sanity tests on it in the meantime.

Doesn't seem to make a difference here
   
   OK, inspection isn't cutting it, so time for tracing.  Does the system
   respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
   the problem occurs, then dump the trace buffer after the problem occurs.
  
  Sorry for being unresposive here, but I know next to nothing about tracing
  or most things about the kernel, so I have some cathing up to do.
  
  In the meantime some layman observations while I tried to find what exactly
  triggers the problem.
  - Even in runlevel 1 I can reliably trigger the problem by starting libvirtd
  - libvirtd seems to be very active in using all sorts of kernel facilities
that are modules on fedora so it seems to cause many simultaneous kworker 
calls to modprobe
  - there are 8 kworker/u16 from 0 to 7
  - one of these kworkers always deadlocks, while there appear to be two
kworker/u16:6 - the seventh
 
 Adding Tejun on CC in case this duplication of kworker/u16:6 is important.
 
6 vs 8 as in 6 rcuos where before they were always 8
  
  Just observations from someone who still doesn't know what the u16
  kworkers are..
 
 Could you please run the following diagnostic patch?  This will help
 me see if I have managed to miswire the rcuo kthreads.  It should
 print some information at task-hang time.

So here the output with todays linux tip and the diagnostic patch.
This is the case with just starting libvird in runlevel 1.
Also a snapshot  of the kworker/u16 s

6 ?S  0:00  \_ [kworker/u16:0]
  553 ?S  0:00  |   \_ [kworker/u16:0]
  554 ?D  0:00  |   \_ /sbin/modprobe -q -- bridge
   78 ?S  0:00  \_ [kworker/u16:1]
   92 ?S  0:00  \_ [kworker/u16:2]
   93 ?S  0:00  \_ [kworker/u16:3]
   94 ?S  0:00  \_ [kworker/u16:4]
   95 ?S  0:00  \_ [kworker/u16:5]
   96 ?D  0:00  \_ [kworker/u16:6]
  105 ?S  0:00  \_ [kworker/u16:7]
  108 ?S  0:00  \_ [kworker/u16:8]


INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
  Not tainted 3.18.0-rc1+ #16
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
kworker/u16:6   D 8800ca9ecec0 1155296  2 0x
Workqueue: netns cleanup_net
 880221fff9c8 0096 8800ca9ecec0 001d5f00
 880221d8 001d5f00 88022326 8800ca9ecec0
 82c44010 7fff 81ee3798 81ee3790
Call Trace:
 [81866219] schedule+0x29/0x70
 [8186b43c] schedule_timeout+0x26c/0x410
 [81028bea] ? native_sched_clock+0x2a/0xa0
 [8110748c] ? mark_held_locks+0x7c/0xb0
 [8186c4c0] ? _raw_spin_unlock_irq+0x30/0x50
 [8110761d] ? trace_hardirqs_on_caller+0x15d/0x200
 [81867c4c] wait_for_completion+0x10c/0x150
 [810e4dc0] ? wake_up_state+0x20/0x20
 [81133627] _rcu_barrier+0x677/0xcd0
 [81133cd5] rcu_barrier+0x15/0x20
 [81720edf] netdev_run_todo+0x6f/0x310
 [81715aa5] ? rollback_registered_many+0x265/0x2e0
 [8172df4e] rtnl_unlock+0xe/0x10
 [81717906] default_device_exit_batch+0x156/0x180
 [810fd280] ? abort_exclusive_wait+0xb0/0xb0
 [8170f9b3] ops_exit_list.isra.1+0x53/0x60
 [81710560] cleanup_net+0x100/0x1f0
 [810cc988] process_one_work+0x218/0x850
 [810cc8ef] ? process_one_work+0x17f/0x850
 [810cd0a7] ? worker_thread+0xe7/0x4a0
 [810cd02b] worker_thread+0x6b/0x4a0
 [810ccfc0] ? process_one_work+0x850/0x850
 [810d337b] kthread+0x10b/0x130
 [81028c69] ? sched_clock+0x9/0x10
 [810d3270] ? kthread_create_on_node+0x250/0x250
 [8186d1fc] ret_from_fork+0x7c/0xb0
 [810d3270] ? kthread_create_on_node+0x250/0x250
4 locks held by kworker/u16:6/96:
 #0:  (%snetns){.+.+.+}, at: [810cc8ef]
 #process_one_work+0x17f/0x850
 #1:  (net_cleanup_work){+.+.+.}, at: [810cc8ef]
 #process_one_work+0x17f/0x850
 #2:  (net_mutex){+.+.+.}, at: [817104ec] cleanup_net+0x8c/0x1f0
 #3:  (rcu_sched_state.barrier_mutex){+.+...}, at: [81133025]
 #_rcu_barrier+0x75/0xcd0
rcu_show_nocb_setup(): rcu_sched nocb state:
  0: 8802267ced40 l:8802267ced40 n:8802269ced40 .G.
  1: 8802269ced40 l:8802267ced40 n:  (null) ...
  2: 880226bced40 l:880226bced40 n:880226dced40 .G.
  3: 880226dced40 l:880226bced40 n:  (null) N..
  4: 880226fced40 l:880226fced40 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 On Fri, Oct 24, 2014 at 05:20:48PM -0700, Jay Vosburgh wrote:
  Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
  
  On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
  [...]
   Hmmm...  It sure looks like we have some callbacks stuck here.  I 
   clearly
   need to take a hard look at the sleep/wakeup code.
   
   Thank you for running this!!!
  
  Could you please try the following patch?  If no joy, could you please
  add rcu:rcu_nocb_wake to the list of ftrace events?
  
 I tried the patch, it did not change the behavior.
  
 I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints
  and ran it again (with this patch and the first patch from earlier
  today); the trace output is a bit on the large side so I put it and the
  dmesg log at:
  
  http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt
  
  http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt
 
 Thank you again!
 
 Very strange part of the trace.  The only sign of CPU 2 and 3 are:
 
 ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Begin 
  cpu -1 remaining 0 # 0
 ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Check 
  cpu -1 remaining 0 # 0
 ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched Inc1 
  cpu -1 remaining 0 # 1
 ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
  OnlineNoCB cpu 0 remaining 1 # 1
 ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 0 
  WakeNot
 ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
  OnlineNoCB cpu 1 remaining 2 # 1
 ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 1 
  WakeNot
 ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
  OnlineNoCB cpu 2 remaining 3 # 1
 ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 2 
  WakeNotPoll
 ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
  OnlineNoCB cpu 3 remaining 4 # 1
 ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 3 
  WakeNotPoll
 ovs-vswitchd-902   [000]    109.896843: rcu_barrier: rcu_sched Inc2 
  cpu -1 remaining 4 # 2
 
 The pair of WakeNotPoll trace entries says that at that point, RCU believed
 that the CPU 2's and CPU 3's rcuo kthreads did not exist.  :-/
 
   On the test system I'm using, CPUs 2 and 3 really do not exist;
 it is a 2 CPU system (Intel Core 2 Duo E8400). I mentioned this in an
 earlier message, but perhaps you missed it in the flurry.

Or forgot it.  Either way, thank you for reminding me.

   Looking at the dmesg, the early boot messages seem to be
 confused as to how many CPUs there are, e.g.,
 
 [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
 [0.00] Hierarchical RCU implementation.
 [0.00]  RCU debugfs-based tracing is enabled.
 [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
 [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
 [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
 [0.00] NR_IRQS:16640 nr_irqs:456 0
 [0.00]  Offload RCU callbacks from all CPUs
 [0.00]  Offload RCU callbacks from CPUs: 0-3.
 
   but later shows 2:
 
 [0.233703] x86: Booting SMP configuration:
 [0.236003]  node  #0, CPUs:  #1
 [0.255528] x86: Booted up 1 node, 2 CPUs
 
   In any event, the E8400 is a 2 core CPU with no hyperthreading.

Well, this might explain some of the difficulties.  If RCU decides to wait
on CPUs that don't exist, we will of course get a hang.  And rcu_barrier()
was definitely expecting four CPUs.

So what happens if you boot with maxcpus=2?  (Or build with
CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
I might have some ideas for a real fix.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 05:20:48PM -0700, Jay Vosburgh wrote:
> Paul E. McKenney  wrote:
> 
> >On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
> [...]
> >> Hmmm...  It sure looks like we have some callbacks stuck here.  I clearly
> >> need to take a hard look at the sleep/wakeup code.
> >> 
> >> Thank you for running this!!!
> >
> >Could you please try the following patch?  If no joy, could you please
> >add rcu:rcu_nocb_wake to the list of ftrace events?
> 
>   I tried the patch, it did not change the behavior.
> 
>   I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints
> and ran it again (with this patch and the first patch from earlier
> today); the trace output is a bit on the large side so I put it and the
> dmesg log at:
> 
> http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt
> 
> http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt

Thank you again!

Very strange part of the trace.  The only sign of CPU 2 and 3 are:

ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Begin 
cpu -1 remaining 0 # 0
ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Check 
cpu -1 remaining 0 # 0
ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched Inc1 cpu 
-1 remaining 0 # 1
ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
OnlineNoCB cpu 0 remaining 1 # 1
ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 0 
WakeNot
ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
OnlineNoCB cpu 1 remaining 2 # 1
ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 1 
WakeNot
ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
OnlineNoCB cpu 2 remaining 3 # 1
ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 2 
WakeNotPoll
ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
OnlineNoCB cpu 3 remaining 4 # 1
ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 3 
WakeNotPoll
ovs-vswitchd-902   [000]    109.896843: rcu_barrier: rcu_sched Inc2 cpu 
-1 remaining 4 # 2

The pair of WakeNotPoll trace entries says that at that point, RCU believed
that the CPU 2's and CPU 3's rcuo kthreads did not exist.  :-/

More diagnostics in order...

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh
Paul E. McKenney  wrote:

>On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
[...]
>> Hmmm...  It sure looks like we have some callbacks stuck here.  I clearly
>> need to take a hard look at the sleep/wakeup code.
>> 
>> Thank you for running this!!!
>
>Could you please try the following patch?  If no joy, could you please
>add rcu:rcu_nocb_wake to the list of ftrace events?

I tried the patch, it did not change the behavior.

I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints
and ran it again (with this patch and the first patch from earlier
today); the trace output is a bit on the large side so I put it and the
dmesg log at:

http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt

http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt

-J


>   Thanx, Paul
>
>
>
>rcu: Kick rcuo kthreads after their CPU goes offline
>
>If a no-CBs CPU were to post an RCU callback with interrupts disabled
>after it entered the idle loop for the last time, there might be no
>deferred wakeup for the corresponding rcuo kthreads.  This commit
>therefore adds a set of calls to do_nocb_deferred_wakeup() after the
>CPU has gone completely offline.
>
>Signed-off-by: Paul E. McKenney 
>
>diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>index 84b41b3c6ebd..f6880052b917 100644
>--- a/kernel/rcu/tree.c
>+++ b/kernel/rcu/tree.c
>@@ -3493,8 +3493,10 @@ static int rcu_cpu_notify(struct notifier_block *self,
>   case CPU_DEAD_FROZEN:
>   case CPU_UP_CANCELED:
>   case CPU_UP_CANCELED_FROZEN:
>-  for_each_rcu_flavor(rsp)
>+  for_each_rcu_flavor(rsp) {
>   rcu_cleanup_dead_cpu(cpu, rsp);
>+  do_nocb_deferred_wakeup(per_cpu_ptr(rsp->rda, cpu));
>+  }
>   break;
>   default:
>   break;
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 24, 2014 at 03:34:07PM -0700, Jay Vosburgh wrote:
> > Paul E. McKenney  wrote:
> > 
> > >On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
> > >> On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
> > >> > On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
> > >> > > On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
> > >
> > >[ . . . ]
> > >
> > >> > > > Well, if you are feeling aggressive, give the following patch a 
> > >> > > > spin.
> > >> > > > I am doing sanity tests on it in the meantime.
> > >> > > 
> > >> > > Doesn't seem to make a difference here
> > >> > 
> > >> > OK, inspection isn't cutting it, so time for tracing.  Does the system
> > >> > respond to user input?  If so, please enable rcu:rcu_barrier ftrace 
> > >> > before
> > >> > the problem occurs, then dump the trace buffer after the problem 
> > >> > occurs.
> > >> 
> > >> Sorry for being unresposive here, but I know next to nothing about 
> > >> tracing
> > >> or most things about the kernel, so I have some cathing up to do.
> > >> 
> > >> In the meantime some layman observations while I tried to find what 
> > >> exactly
> > >> triggers the problem.
> > >> - Even in runlevel 1 I can reliably trigger the problem by starting 
> > >> libvirtd
> > >> - libvirtd seems to be very active in using all sorts of kernel 
> > >> facilities
> > >>   that are modules on fedora so it seems to cause many simultaneous 
> > >> kworker 
> > >>   calls to modprobe
> > >> - there are 8 kworker/u16 from 0 to 7
> > >> - one of these kworkers always deadlocks, while there appear to be two
> > >>   kworker/u16:6 - the seventh
> > >
> > >Adding Tejun on CC in case this duplication of kworker/u16:6 is important.
> > >
> > >>   6 vs 8 as in 6 rcuos where before they were always 8
> > >> 
> > >> Just observations from someone who still doesn't know what the u16
> > >> kworkers are..
> > >
> > >Could you please run the following diagnostic patch?  This will help
> > >me see if I have managed to miswire the rcuo kthreads.  It should
> > >print some information at task-hang time.
> > 
> > Here's the output of the patch; I let it sit through two hang
> > cycles.
> > 
> > -J
> > 
> > 
> > [  240.348020] INFO: task ovs-vswitchd:902 blocked for more than 120 
> > seconds.
> > [  240.354878]   Not tainted 3.17.0-testola+ #4
> > [  240.359481] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [  240.367285] ovs-vswitchdD 88013fc94600 0   902901 
> > 0x0004
> > [  240.367290]  8800ab20f7b8 0002 8800b3304b00 
> > 8800ab20ffd8
> > [  240.367293]  00014600 00014600 8800b081 
> > 8800b3304b00
> > [  240.367296]  8800b3304b00 81c59850 81c59858 
> > 7fff
> > [  240.367300] Call Trace:
> > [  240.367307]  [] schedule+0x29/0x70
> > [  240.367310]  [] schedule_timeout+0x1dc/0x260
> > [  240.367313]  [] ? _cond_resched+0x29/0x40
> > [  240.367316]  [] ? wait_for_completion+0x28/0x160
> > [  240.367321]  [] ? queue_stop_cpus_work+0xc7/0xe0
> > [  240.367324]  [] wait_for_completion+0xa6/0x160
> > [  240.367328]  [] ? wake_up_state+0x20/0x20
> > [  240.367331]  [] _rcu_barrier+0x20c/0x480
> > [  240.367334]  [] rcu_barrier+0x15/0x20
> > [  240.367338]  [] netdev_run_todo+0x60/0x300
> > [  240.367341]  [] rtnl_unlock+0xe/0x10
> > [  240.367349]  [] internal_dev_destroy+0x55/0x80 
> > [openvswitch]
> > [  240.367354]  [] ovs_vport_del+0x32/0x40 [openvswitch]
> > [  240.367358]  [] ovs_dp_detach_port+0x30/0x40 
> > [openvswitch]
> > [  240.367363]  [] ovs_vport_cmd_del+0xc5/0x110 
> > [openvswitch]
> > [  240.367367]  [] genl_family_rcv_msg+0x1a5/0x3c0
> > [  240.367370]  [] ? genl_family_rcv_msg+0x3c0/0x3c0
> > [  240.367372]  [] genl_rcv_msg+0x91/0xd0
> > [  240.367376]  [] netlink_rcv_skb+0xc1/0xe0
> > [  240.367378]  [] genl_rcv+0x2c/0x40
> > [  240.367381]  [] netlink_unicast+0xf6/0x200
> > [  240.367383]  [] netlink_sendmsg+0x31d/0x780
> > [  240.367387]  [] ? netlink_rcv_wake+0x44/0x60
> > [  240.367391]  [] sock_sendmsg+0x93/0xd0
> > [  240.367395]  [] ? apparmor_capable+0x60/0x60
> > [  240.367399]  [] ? verify_iovec+0x47/0xd0
> > [  240.367402]  [] ___sys_sendmsg+0x399/0x3b0
> > [  240.367406]  [] ? kernfs_seq_stop_active+0x32/0x40
> > [  240.367410]  [] ? native_sched_clock+0x35/0x90
> > [  240.367413]  [] ? native_sched_clock+0x35/0x90
> > [  240.367416]  [] ? sched_clock+0x9/0x10
> > [  240.367420]  [] ? acct_account_cputime+0x1c/0x20
> > [  240.367424]  [] ? account_user_time+0x8b/0xa0
> > [  240.367428]  [] ? __fget_light+0x25/0x70
> > [  240.367431]  [] __sys_sendmsg+0x42/0x80
> > [  240.367433]  [] SyS_sendmsg+0x12/0x20
> > [  240.367436]  [] tracesys_phase2+0xd8/0xdd
> > [  240.367439] rcu_show_nocb_setup(): rcu_sched nocb state:
> > [  240.372734]   0: 88013fc0e600 l:88013fc0e600 n:88013fc8e600 
> > .G.
> 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 03:34:07PM -0700, Jay Vosburgh wrote:
> Paul E. McKenney  wrote:
> 
> >On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
> >> On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
> >> > On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
> >> > > On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
> >
> >[ . . . ]
> >
> >> > > > Well, if you are feeling aggressive, give the following patch a spin.
> >> > > > I am doing sanity tests on it in the meantime.
> >> > > 
> >> > > Doesn't seem to make a difference here
> >> > 
> >> > OK, inspection isn't cutting it, so time for tracing.  Does the system
> >> > respond to user input?  If so, please enable rcu:rcu_barrier ftrace 
> >> > before
> >> > the problem occurs, then dump the trace buffer after the problem occurs.
> >> 
> >> Sorry for being unresposive here, but I know next to nothing about tracing
> >> or most things about the kernel, so I have some cathing up to do.
> >> 
> >> In the meantime some layman observations while I tried to find what exactly
> >> triggers the problem.
> >> - Even in runlevel 1 I can reliably trigger the problem by starting 
> >> libvirtd
> >> - libvirtd seems to be very active in using all sorts of kernel facilities
> >>   that are modules on fedora so it seems to cause many simultaneous 
> >> kworker 
> >>   calls to modprobe
> >> - there are 8 kworker/u16 from 0 to 7
> >> - one of these kworkers always deadlocks, while there appear to be two
> >>   kworker/u16:6 - the seventh
> >
> >Adding Tejun on CC in case this duplication of kworker/u16:6 is important.
> >
> >>   6 vs 8 as in 6 rcuos where before they were always 8
> >> 
> >> Just observations from someone who still doesn't know what the u16
> >> kworkers are..
> >
> >Could you please run the following diagnostic patch?  This will help
> >me see if I have managed to miswire the rcuo kthreads.  It should
> >print some information at task-hang time.
> 
>   Here's the output of the patch; I let it sit through two hang
> cycles.
> 
>   -J
> 
> 
> [  240.348020] INFO: task ovs-vswitchd:902 blocked for more than 120 seconds.
> [  240.354878]   Not tainted 3.17.0-testola+ #4
> [  240.359481] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  240.367285] ovs-vswitchdD 88013fc94600 0   902901 
> 0x0004
> [  240.367290]  8800ab20f7b8 0002 8800b3304b00 
> 8800ab20ffd8
> [  240.367293]  00014600 00014600 8800b081 
> 8800b3304b00
> [  240.367296]  8800b3304b00 81c59850 81c59858 
> 7fff
> [  240.367300] Call Trace:
> [  240.367307]  [] schedule+0x29/0x70
> [  240.367310]  [] schedule_timeout+0x1dc/0x260
> [  240.367313]  [] ? _cond_resched+0x29/0x40
> [  240.367316]  [] ? wait_for_completion+0x28/0x160
> [  240.367321]  [] ? queue_stop_cpus_work+0xc7/0xe0
> [  240.367324]  [] wait_for_completion+0xa6/0x160
> [  240.367328]  [] ? wake_up_state+0x20/0x20
> [  240.367331]  [] _rcu_barrier+0x20c/0x480
> [  240.367334]  [] rcu_barrier+0x15/0x20
> [  240.367338]  [] netdev_run_todo+0x60/0x300
> [  240.367341]  [] rtnl_unlock+0xe/0x10
> [  240.367349]  [] internal_dev_destroy+0x55/0x80 
> [openvswitch]
> [  240.367354]  [] ovs_vport_del+0x32/0x40 [openvswitch]
> [  240.367358]  [] ovs_dp_detach_port+0x30/0x40 
> [openvswitch]
> [  240.367363]  [] ovs_vport_cmd_del+0xc5/0x110 
> [openvswitch]
> [  240.367367]  [] genl_family_rcv_msg+0x1a5/0x3c0
> [  240.367370]  [] ? genl_family_rcv_msg+0x3c0/0x3c0
> [  240.367372]  [] genl_rcv_msg+0x91/0xd0
> [  240.367376]  [] netlink_rcv_skb+0xc1/0xe0
> [  240.367378]  [] genl_rcv+0x2c/0x40
> [  240.367381]  [] netlink_unicast+0xf6/0x200
> [  240.367383]  [] netlink_sendmsg+0x31d/0x780
> [  240.367387]  [] ? netlink_rcv_wake+0x44/0x60
> [  240.367391]  [] sock_sendmsg+0x93/0xd0
> [  240.367395]  [] ? apparmor_capable+0x60/0x60
> [  240.367399]  [] ? verify_iovec+0x47/0xd0
> [  240.367402]  [] ___sys_sendmsg+0x399/0x3b0
> [  240.367406]  [] ? kernfs_seq_stop_active+0x32/0x40
> [  240.367410]  [] ? native_sched_clock+0x35/0x90
> [  240.367413]  [] ? native_sched_clock+0x35/0x90
> [  240.367416]  [] ? sched_clock+0x9/0x10
> [  240.367420]  [] ? acct_account_cputime+0x1c/0x20
> [  240.367424]  [] ? account_user_time+0x8b/0xa0
> [  240.367428]  [] ? __fget_light+0x25/0x70
> [  240.367431]  [] __sys_sendmsg+0x42/0x80
> [  240.367433]  [] SyS_sendmsg+0x12/0x20
> [  240.367436]  [] tracesys_phase2+0xd8/0xdd
> [  240.367439] rcu_show_nocb_setup(): rcu_sched nocb state:
> [  240.372734]   0: 88013fc0e600 l:88013fc0e600 n:88013fc8e600 .G.
> [  240.379673]   1: 88013fc8e600 l:88013fc0e600 n:  (null) .G.
> [  240.386611]   2: 88013fd0e600 l:88013fd0e600 n:88013fd8e600 N..
> [  240.393550]   3: 88013fd8e600 l:88013fd0e600 n:  (null) N..
> [  240.400489] rcu_show_nocb_setup(): rcu_bh nocb state:
> [  240.405525]   0: 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh
Paul E. McKenney  wrote:

>On Fri, Oct 24, 2014 at 03:02:04PM -0700, Jay Vosburgh wrote:
>> Paul E. McKenney  wrote:
>> 
[...]
>>  I've got an ftrace capture from unmodified -net, it looks like
>> this:
>> 
>> ovs-vswitchd-902   [000]    471.778441: rcu_barrier: rcu_sched Begin 
>> cpu -1 remaining 0 # 0
>> ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Check 
>> cpu -1 remaining 0 # 0
>> ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Inc1 
>> cpu -1 remaining 0 # 1
>> ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
>> OnlineNoCB cpu 0 remaining 1 # 1
>> ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
>> OnlineNoCB cpu 1 remaining 2 # 1
>> ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
>> OnlineNoCB cpu 2 remaining 3 # 1
>> ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched 
>> OnlineNoCB cpu 3 remaining 4 # 1
>
>OK, so it looks like your system has four CPUs, and rcu_barrier() placed
>callbacks on them all.

No, the system has only two CPUs.  It's an Intel Core 2 Duo
E8400, and /proc/cpuinfo agrees that there are only 2.  There is a
potentially relevant-sounding message early in dmesg that says:

[0.00] smpboot: Allowing 4 CPUs, 2 hotplug CPUs

>> ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched Inc2 
>> cpu -1 remaining 4 # 2
>
>The above removes the extra count used to avoid races between posting new
>callbacks and completion of previously posted callbacks.
>
>>  rcuos/0-9 [000] ..s.   471.793150: rcu_barrier: rcu_sched CB 
>> cpu -1 remaining 3 # 2
>>  rcuos/1-18[001] ..s.   471.793308: rcu_barrier: rcu_sched CB 
>> cpu -1 remaining 2 # 2
>
>Two of the four callbacks fired, but the other two appear to be AWOL.
>And rcu_barrier() won't return until they all fire.
>
>>  I let it sit through several "hung task" cycles but that was all
>> there was for rcu:rcu_barrier.
>> 
>>  I should have ftrace with the patch as soon as the kernel is
>> done building, then I can try the below patch (I'll start it building
>> now).
>
>Sounds very good, looking forward to hearing of the results.

Going to bounce it for ftrace now, but the cpu count mismatch
seemed important enough to mention separately.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh
Paul E. McKenney  wrote:

>On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
>> On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
>> > On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
>> > > On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
>
>[ . . . ]
>
>> > > > Well, if you are feeling aggressive, give the following patch a spin.
>> > > > I am doing sanity tests on it in the meantime.
>> > > 
>> > > Doesn't seem to make a difference here
>> > 
>> > OK, inspection isn't cutting it, so time for tracing.  Does the system
>> > respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
>> > the problem occurs, then dump the trace buffer after the problem occurs.
>> 
>> Sorry for being unresposive here, but I know next to nothing about tracing
>> or most things about the kernel, so I have some cathing up to do.
>> 
>> In the meantime some layman observations while I tried to find what exactly
>> triggers the problem.
>> - Even in runlevel 1 I can reliably trigger the problem by starting libvirtd
>> - libvirtd seems to be very active in using all sorts of kernel facilities
>>   that are modules on fedora so it seems to cause many simultaneous kworker 
>>   calls to modprobe
>> - there are 8 kworker/u16 from 0 to 7
>> - one of these kworkers always deadlocks, while there appear to be two
>>   kworker/u16:6 - the seventh
>
>Adding Tejun on CC in case this duplication of kworker/u16:6 is important.
>
>>   6 vs 8 as in 6 rcuos where before they were always 8
>> 
>> Just observations from someone who still doesn't know what the u16
>> kworkers are..
>
>Could you please run the following diagnostic patch?  This will help
>me see if I have managed to miswire the rcuo kthreads.  It should
>print some information at task-hang time.

Here's the output of the patch; I let it sit through two hang
cycles.

-J


[  240.348020] INFO: task ovs-vswitchd:902 blocked for more than 120 seconds.
[  240.354878]   Not tainted 3.17.0-testola+ #4
[  240.359481] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  240.367285] ovs-vswitchdD 88013fc94600 0   902901 0x0004
[  240.367290]  8800ab20f7b8 0002 8800b3304b00 
8800ab20ffd8
[  240.367293]  00014600 00014600 8800b081 
8800b3304b00
[  240.367296]  8800b3304b00 81c59850 81c59858 
7fff
[  240.367300] Call Trace:
[  240.367307]  [] schedule+0x29/0x70
[  240.367310]  [] schedule_timeout+0x1dc/0x260
[  240.367313]  [] ? _cond_resched+0x29/0x40
[  240.367316]  [] ? wait_for_completion+0x28/0x160
[  240.367321]  [] ? queue_stop_cpus_work+0xc7/0xe0
[  240.367324]  [] wait_for_completion+0xa6/0x160
[  240.367328]  [] ? wake_up_state+0x20/0x20
[  240.367331]  [] _rcu_barrier+0x20c/0x480
[  240.367334]  [] rcu_barrier+0x15/0x20
[  240.367338]  [] netdev_run_todo+0x60/0x300
[  240.367341]  [] rtnl_unlock+0xe/0x10
[  240.367349]  [] internal_dev_destroy+0x55/0x80 
[openvswitch]
[  240.367354]  [] ovs_vport_del+0x32/0x40 [openvswitch]
[  240.367358]  [] ovs_dp_detach_port+0x30/0x40 [openvswitch]
[  240.367363]  [] ovs_vport_cmd_del+0xc5/0x110 [openvswitch]
[  240.367367]  [] genl_family_rcv_msg+0x1a5/0x3c0
[  240.367370]  [] ? genl_family_rcv_msg+0x3c0/0x3c0
[  240.367372]  [] genl_rcv_msg+0x91/0xd0
[  240.367376]  [] netlink_rcv_skb+0xc1/0xe0
[  240.367378]  [] genl_rcv+0x2c/0x40
[  240.367381]  [] netlink_unicast+0xf6/0x200
[  240.367383]  [] netlink_sendmsg+0x31d/0x780
[  240.367387]  [] ? netlink_rcv_wake+0x44/0x60
[  240.367391]  [] sock_sendmsg+0x93/0xd0
[  240.367395]  [] ? apparmor_capable+0x60/0x60
[  240.367399]  [] ? verify_iovec+0x47/0xd0
[  240.367402]  [] ___sys_sendmsg+0x399/0x3b0
[  240.367406]  [] ? kernfs_seq_stop_active+0x32/0x40
[  240.367410]  [] ? native_sched_clock+0x35/0x90
[  240.367413]  [] ? native_sched_clock+0x35/0x90
[  240.367416]  [] ? sched_clock+0x9/0x10
[  240.367420]  [] ? acct_account_cputime+0x1c/0x20
[  240.367424]  [] ? account_user_time+0x8b/0xa0
[  240.367428]  [] ? __fget_light+0x25/0x70
[  240.367431]  [] __sys_sendmsg+0x42/0x80
[  240.367433]  [] SyS_sendmsg+0x12/0x20
[  240.367436]  [] tracesys_phase2+0xd8/0xdd
[  240.367439] rcu_show_nocb_setup(): rcu_sched nocb state:
[  240.372734]   0: 88013fc0e600 l:88013fc0e600 n:88013fc8e600 .G.
[  240.379673]   1: 88013fc8e600 l:88013fc0e600 n:  (null) .G.
[  240.386611]   2: 88013fd0e600 l:88013fd0e600 n:88013fd8e600 N..
[  240.393550]   3: 88013fd8e600 l:88013fd0e600 n:  (null) N..
[  240.400489] rcu_show_nocb_setup(): rcu_bh nocb state:
[  240.405525]   0: 88013fc0e3c0 l:88013fc0e3c0 n:88013fc8e3c0 ...
[  240.412463]   1: 88013fc8e3c0 l:88013fc0e3c0 n:  (null) ...
[  240.419401]   2: 88013fd0e3c0 l:88013fd0e3c0 n:88013fd8e3c0 ...
[  240.426339]   3: 88013fd8e3c0 l:88013fd0e3c0 n:  (null) ...
[  

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 03:02:04PM -0700, Jay Vosburgh wrote:
> Paul E. McKenney  wrote:
> 
> >On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
> >> On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
> >> > On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
> >> > > On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
> >
> >[ . . . ]
> >
> >> > > > Well, if you are feeling aggressive, give the following patch a spin.
> >> > > > I am doing sanity tests on it in the meantime.
> >> > > 
> >> > > Doesn't seem to make a difference here
> >> > 
> >> > OK, inspection isn't cutting it, so time for tracing.  Does the system
> >> > respond to user input?  If so, please enable rcu:rcu_barrier ftrace 
> >> > before
> >> > the problem occurs, then dump the trace buffer after the problem occurs.
> >> 
> >> Sorry for being unresposive here, but I know next to nothing about tracing
> >> or most things about the kernel, so I have some cathing up to do.
> >> 
> >> In the meantime some layman observations while I tried to find what exactly
> >> triggers the problem.
> >> - Even in runlevel 1 I can reliably trigger the problem by starting 
> >> libvirtd
> >> - libvirtd seems to be very active in using all sorts of kernel facilities
> >>   that are modules on fedora so it seems to cause many simultaneous 
> >> kworker 
> >>   calls to modprobe
> >> - there are 8 kworker/u16 from 0 to 7
> >> - one of these kworkers always deadlocks, while there appear to be two
> >>   kworker/u16:6 - the seventh
> >
> >Adding Tejun on CC in case this duplication of kworker/u16:6 is important.
> >
> >>   6 vs 8 as in 6 rcuos where before they were always 8
> >> 
> >> Just observations from someone who still doesn't know what the u16
> >> kworkers are..
> >
> >Could you please run the following diagnostic patch?  This will help
> >me see if I have managed to miswire the rcuo kthreads.  It should
> >print some information at task-hang time.
> 
>   I can give this a spin after the ftrace (now that I've got
> CONFIG_RCU_TRACE turned on).
> 
>   I've got an ftrace capture from unmodified -net, it looks like
> this:
> 
> ovs-vswitchd-902   [000]    471.778441: rcu_barrier: rcu_sched Begin 
> cpu -1 remaining 0 # 0
> ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Check 
> cpu -1 remaining 0 # 0
> ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Inc1 
> cpu -1 remaining 0 # 1
> ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
> OnlineNoCB cpu 0 remaining 1 # 1
> ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
> OnlineNoCB cpu 1 remaining 2 # 1
> ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
> OnlineNoCB cpu 2 remaining 3 # 1
> ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched 
> OnlineNoCB cpu 3 remaining 4 # 1

OK, so it looks like your system has four CPUs, and rcu_barrier() placed
callbacks on them all.

> ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched Inc2 
> cpu -1 remaining 4 # 2

The above removes the extra count used to avoid races between posting new
callbacks and completion of previously posted callbacks.

>  rcuos/0-9 [000] ..s.   471.793150: rcu_barrier: rcu_sched CB cpu 
> -1 remaining 3 # 2
>  rcuos/1-18[001] ..s.   471.793308: rcu_barrier: rcu_sched CB cpu 
> -1 remaining 2 # 2

Two of the four callbacks fired, but the other two appear to be AWOL.
And rcu_barrier() won't return until they all fire.

>   I let it sit through several "hung task" cycles but that was all
> there was for rcu:rcu_barrier.
> 
>   I should have ftrace with the patch as soon as the kernel is
> done building, then I can try the below patch (I'll start it building
> now).

Sounds very good, looking forward to hearing of the results.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh
Paul E. McKenney  wrote:

>On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
>> On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
>> > On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
>> > > On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
>
>[ . . . ]
>
>> > > > Well, if you are feeling aggressive, give the following patch a spin.
>> > > > I am doing sanity tests on it in the meantime.
>> > > 
>> > > Doesn't seem to make a difference here
>> > 
>> > OK, inspection isn't cutting it, so time for tracing.  Does the system
>> > respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
>> > the problem occurs, then dump the trace buffer after the problem occurs.
>> 
>> Sorry for being unresposive here, but I know next to nothing about tracing
>> or most things about the kernel, so I have some cathing up to do.
>> 
>> In the meantime some layman observations while I tried to find what exactly
>> triggers the problem.
>> - Even in runlevel 1 I can reliably trigger the problem by starting libvirtd
>> - libvirtd seems to be very active in using all sorts of kernel facilities
>>   that are modules on fedora so it seems to cause many simultaneous kworker 
>>   calls to modprobe
>> - there are 8 kworker/u16 from 0 to 7
>> - one of these kworkers always deadlocks, while there appear to be two
>>   kworker/u16:6 - the seventh
>
>Adding Tejun on CC in case this duplication of kworker/u16:6 is important.
>
>>   6 vs 8 as in 6 rcuos where before they were always 8
>> 
>> Just observations from someone who still doesn't know what the u16
>> kworkers are..
>
>Could you please run the following diagnostic patch?  This will help
>me see if I have managed to miswire the rcuo kthreads.  It should
>print some information at task-hang time.

I can give this a spin after the ftrace (now that I've got
CONFIG_RCU_TRACE turned on).

I've got an ftrace capture from unmodified -net, it looks like
this:

ovs-vswitchd-902   [000]    471.778441: rcu_barrier: rcu_sched Begin 
cpu -1 remaining 0 # 0
ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Check 
cpu -1 remaining 0 # 0
ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Inc1 cpu 
-1 remaining 0 # 1
ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
OnlineNoCB cpu 0 remaining 1 # 1
ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
OnlineNoCB cpu 1 remaining 2 # 1
ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
OnlineNoCB cpu 2 remaining 3 # 1
ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched 
OnlineNoCB cpu 3 remaining 4 # 1
ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched Inc2 cpu 
-1 remaining 4 # 2
 rcuos/0-9 [000] ..s.   471.793150: rcu_barrier: rcu_sched CB cpu 
-1 remaining 3 # 2
 rcuos/1-18[001] ..s.   471.793308: rcu_barrier: rcu_sched CB cpu 
-1 remaining 2 # 2

I let it sit through several "hung task" cycles but that was all
there was for rcu:rcu_barrier.

I should have ftrace with the patch as soon as the kernel is
done building, then I can try the below patch (I'll start it building
now).

-J




>   Thanx, Paul
>
>
>
>rcu: Dump no-CBs CPU state at task-hung time
>
>Strictly diagnostic commit for rcu_barrier() hang.  Not for inclusion.
>
>Signed-off-by: Paul E. McKenney 
>
>diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
>index 0e5366200154..34048140577b 100644
>--- a/include/linux/rcutiny.h
>+++ b/include/linux/rcutiny.h
>@@ -157,4 +157,8 @@ static inline bool rcu_is_watching(void)
> 
> #endif /* #else defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) 
> */
> 
>+static inline void rcu_show_nocb_setup(void)
>+{
>+}
>+
> #endif /* __LINUX_RCUTINY_H */
>diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
>index 52953790dcca..0b813bdb971b 100644
>--- a/include/linux/rcutree.h
>+++ b/include/linux/rcutree.h
>@@ -97,4 +97,6 @@ extern int rcu_scheduler_active __read_mostly;
> 
> bool rcu_is_watching(void);
> 
>+void rcu_show_nocb_setup(void);
>+
> #endif /* __LINUX_RCUTREE_H */
>diff --git a/kernel/hung_task.c b/kernel/hung_task.c
>index 06db12434d72..e6e4d0f6b063 100644
>--- a/kernel/hung_task.c
>+++ b/kernel/hung_task.c
>@@ -118,6 +118,7 @@ static void check_hung_task(struct task_struct *t, 
>unsigned long timeout)
>   " disables this message.\n");
>   sched_show_task(t);
>   debug_show_held_locks(t);
>+  rcu_show_nocb_setup();
> 
>   touch_nmi_watchdog();
> 
>diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
>index 240fa9094f83..6b373e79ce0e 100644
>--- a/kernel/rcu/rcutorture.c
>+++ b/kernel/rcu/rcutorture.c
>@@ -1513,6 +1513,7 @@ rcu_torture_cleanup(void)
> {
>   int i;
> 
>+  rcu_show_nocb_setup();

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
> On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
> > On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
> > > On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:

[ . . . ]

> > > > Well, if you are feeling aggressive, give the following patch a spin.
> > > > I am doing sanity tests on it in the meantime.
> > > 
> > > Doesn't seem to make a difference here
> > 
> > OK, inspection isn't cutting it, so time for tracing.  Does the system
> > respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
> > the problem occurs, then dump the trace buffer after the problem occurs.
> 
> Sorry for being unresposive here, but I know next to nothing about tracing
> or most things about the kernel, so I have some cathing up to do.
> 
> In the meantime some layman observations while I tried to find what exactly
> triggers the problem.
> - Even in runlevel 1 I can reliably trigger the problem by starting libvirtd
> - libvirtd seems to be very active in using all sorts of kernel facilities
>   that are modules on fedora so it seems to cause many simultaneous kworker 
>   calls to modprobe
> - there are 8 kworker/u16 from 0 to 7
> - one of these kworkers always deadlocks, while there appear to be two
>   kworker/u16:6 - the seventh

Adding Tejun on CC in case this duplication of kworker/u16:6 is important.

>   6 vs 8 as in 6 rcuos where before they were always 8
> 
> Just observations from someone who still doesn't know what the u16
> kworkers are..

Could you please run the following diagnostic patch?  This will help
me see if I have managed to miswire the rcuo kthreads.  It should
print some information at task-hang time.

Thanx, Paul



rcu: Dump no-CBs CPU state at task-hung time

Strictly diagnostic commit for rcu_barrier() hang.  Not for inclusion.

Signed-off-by: Paul E. McKenney 

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 0e5366200154..34048140577b 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -157,4 +157,8 @@ static inline bool rcu_is_watching(void)
 
 #endif /* #else defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) 
*/
 
+static inline void rcu_show_nocb_setup(void)
+{
+}
+
 #endif /* __LINUX_RCUTINY_H */
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 52953790dcca..0b813bdb971b 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -97,4 +97,6 @@ extern int rcu_scheduler_active __read_mostly;
 
 bool rcu_is_watching(void);
 
+void rcu_show_nocb_setup(void);
+
 #endif /* __LINUX_RCUTREE_H */
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 06db12434d72..e6e4d0f6b063 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -118,6 +118,7 @@ static void check_hung_task(struct task_struct *t, unsigned 
long timeout)
" disables this message.\n");
sched_show_task(t);
debug_show_held_locks(t);
+   rcu_show_nocb_setup();
 
touch_nmi_watchdog();
 
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 240fa9094f83..6b373e79ce0e 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1513,6 +1513,7 @@ rcu_torture_cleanup(void)
 {
int i;
 
+   rcu_show_nocb_setup();
rcutorture_record_test_transition();
if (torture_cleanup_begin()) {
if (cur_ops->cb_barrier != NULL)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 927c17b081c7..285b3f6fb229 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2699,6 +2699,31 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)
 
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 
+void rcu_show_nocb_setup(void)
+{
+#ifdef CONFIG_RCU_NOCB_CPU
+   int cpu;
+   struct rcu_data *rdp;
+   struct rcu_state *rsp;
+
+   for_each_rcu_flavor(rsp) {
+   pr_alert("rcu_show_nocb_setup(): %s nocb state:\n", rsp->name);
+   for_each_possible_cpu(cpu) {
+   if (!rcu_is_nocb_cpu(cpu))
+   continue;
+   rdp = per_cpu_ptr(rsp->rda, cpu);
+   pr_alert("%3d: %p l:%p n:%p %c%c%c\n",
+cpu,
+rdp, rdp->nocb_leader, rdp->nocb_next_follower,
+".N"[!!rdp->nocb_head],
+".G"[!!rdp->nocb_gp_head],
+".F"[!!rdp->nocb_follower_head]);
+   }
+   }
+#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
+}
+EXPORT_SYMBOL_GPL(rcu_show_nocb_setup);
+
 /*
  * An adaptive-ticks CPU can potentially execute in kernel mode for an
  * arbitrarily long period of time with the scheduling-clock tick turned

--
To unsubscribe from this list: send the line 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Yanko Kaneti
On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
> On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
> > On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
> > > On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
> > > > On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
> > > > > On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
> > > > > > On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
> > > > > > > On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
> > > > > > > > On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
> > > > > > > > > On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
> > > > > > > > > > 
> > > > > > > > > > On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
> > > > > > > > > > > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti 
> > > > > > > > > > > wrote:
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > > > Ok, unless I've messsed up something major, bisecting points to:
> > > > > > > > 
> > > > > > > > 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
> > > > > > > > 
> > > > > > > > Makes any sense ?
> > > > > > > 
> > > > > > > Good question.  ;-)
> > > > > > > 
> > > > > > > Are any of your online CPUs missing rcuo kthreads?  There should 
> > > > > > > be
> > > > > > > kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each 
> > > > > > > online CPU.
> > > > > > 
> > > > > > Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a 
> > > > > > reverted, the rcuos are 8
> > > > > > and the modprobe ppp_generic testcase reliably works, libvirt also 
> > > > > > manages
> > > > > > to setup its bridge.
> > > > > > 
> > > > > > Just with linux-tip , the rcuos are 6 but the failure is as 
> > > > > > reliable as
> > > > > > before.
> > > > 
> > > > > Thank you, very interesting.  Which 6 of the rcuos are present?
> > > > 
> > > > Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like 
> > > > this   
> > > > Phenom II.
> > > 
> > > Ah, you get 8 without the patch because it creates them for potential
> > > CPUs as well as real ones.  OK, got it.
> > > 
> > > > > > Awating instructions: :)
> > > > > 
> > > > > Well, I thought I understood the problem until you found that only 6 
> > > > > of
> > > > > the expected 8 rcuos are present with linux-tip without the revert.  
> > > > > ;-)
> > > > > 
> > > > > I am putting together a patch for the part of the problem that I think
> > > > > I understand, of course, but it would help a lot to know which two of
> > > > > the rcuos are missing.  ;-)
> > > > 
> > > > Ready to test
> > > 
> > > Well, if you are feeling aggressive, give the following patch a spin.
> > > I am doing sanity tests on it in the meantime.
> > 
> > Doesn't seem to make a difference here
> 
> OK, inspection isn't cutting it, so time for tracing.  Does the system
> respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
> the problem occurs, then dump the trace buffer after the problem occurs.

Sorry for being unresposive here, but I know next to nothing about tracing
or most things about the kernel, so I have some cathing up to do.

In the meantime some layman observations while I tried to find what exactly
triggers the problem.
- Even in runlevel 1 I can reliably trigger the problem by starting libvirtd
- libvirtd seems to be very active in using all sorts of kernel facilities
  that are modules on fedora so it seems to cause many simultaneous kworker 
  calls to modprobe
- there are 8 kworker/u16 from 0 to 7
- one of these kworkers always deadlocks, while there appear to be two
  kworker/u16:6 - the seventh

  6 vs 8 as in 6 rcuos where before they were always 8

Just observations from someone who still doesn't know what the u16
kworkers are..

-- Yanko



 
>   Thanx, Paul
> 
> > > 
> > > 
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 29fb23f33c18..927c17b081c7 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct 
> > > rcu_state *rsp, int cpu)
> > >   rdp->nocb_leader = rdp_spawn;
> > >   if (rdp_last && rdp != rdp_spawn)
> > >   rdp_last->nocb_next_follower = rdp;
> > > - rdp_last = rdp;
> > > - rdp = rdp->nocb_next_follower;
> > > - rdp_last->nocb_next_follower = NULL;
> > > + if (rdp == rdp_spawn) {
> > > + rdp = rdp->nocb_next_follower;
> > > + } else {
> > > + rdp_last = rdp;
> > > + rdp = rdp->nocb_next_follower;
> > > + rdp_last->nocb_next_follower = NULL;
> > > + }
> > >   } while (rdp);
> > >   

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 11:57:53AM -0700, Paul E. McKenney wrote:
> On Fri, Oct 24, 2014 at 11:49:48AM -0700, Jay Vosburgh wrote:
> > Paul E. McKenney  wrote:
> > 
> > >On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
> > >> On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
> > >> > On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
> > >> > > On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
> > >> > > > On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
> > >> > > > > On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
> > >> > > > > > On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
> > >> > > > > > > On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
> > >> > > > > > > > On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti 
> > >> > > > > > > > wrote:
> > >> > > > > > > > > 
> > >> > > > > > > > > On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney 
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti 
> > >> > > > > > > > > > wrote:
> > >> > > > 
> > >> > > > [ . . . ]
> > >> > > > 
> > >> > > > > > > Ok, unless I've messsed up something major, bisecting points 
> > >> > > > > > > to:
> > >> > > > > > > 
> > >> > > > > > > 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
> > >> > > > > > > 
> > >> > > > > > > Makes any sense ?
> > >> > > > > > 
> > >> > > > > > Good question.  ;-)
> > >> > > > > > 
> > >> > > > > > Are any of your online CPUs missing rcuo kthreads?  There 
> > >> > > > > > should be
> > >> > > > > > kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each 
> > >> > > > > > online CPU.
> > >> > > > > 
> > >> > > > > Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a 
> > >> > > > > reverted, the rcuos are 8
> > >> > > > > and the modprobe ppp_generic testcase reliably works, libvirt 
> > >> > > > > also manages
> > >> > > > > to setup its bridge.
> > >> > > > > 
> > >> > > > > Just with linux-tip , the rcuos are 6 but the failure is as 
> > >> > > > > reliable as
> > >> > > > > before.
> > >> > > 
> > >> > > > Thank you, very interesting.  Which 6 of the rcuos are present?
> > >> > > 
> > >> > > Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like 
> > >> > > this   
> > >> > > Phenom II.
> > >> > 
> > >> > Ah, you get 8 without the patch because it creates them for potential
> > >> > CPUs as well as real ones.  OK, got it.
> > >> > 
> > >> > > > > Awating instructions: :)
> > >> > > > 
> > >> > > > Well, I thought I understood the problem until you found that only 
> > >> > > > 6 of
> > >> > > > the expected 8 rcuos are present with linux-tip without the 
> > >> > > > revert.  ;-)
> > >> > > > 
> > >> > > > I am putting together a patch for the part of the problem that I 
> > >> > > > think
> > >> > > > I understand, of course, but it would help a lot to know which two 
> > >> > > > of
> > >> > > > the rcuos are missing.  ;-)
> > >> > > 
> > >> > > Ready to test
> > >> > 
> > >> > Well, if you are feeling aggressive, give the following patch a spin.
> > >> > I am doing sanity tests on it in the meantime.
> > >> 
> > >> Doesn't seem to make a difference here
> > >
> > >OK, inspection isn't cutting it, so time for tracing.  Does the system
> > >respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
> > >the problem occurs, then dump the trace buffer after the problem occurs.
> > 
> > My system is up and responsive when the problem occurs, so this
> > shouldn't be a problem.
> 
> Nice!  ;-)
> 
> > Do you want the ftrace with your patch below, or unmodified tip
> > of tree?
> 
> Let's please start with the patch.

And I should hasten to add that you need to set CONFIG_RCU_TRACE=y
for these tracepoints to be enabled.

Thanx, Paul

> > -J
> > 
> > 
> > >   Thanx, Paul
> > >
> > >> > 
> > >> > 
> > >> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > >> > index 29fb23f33c18..927c17b081c7 100644
> > >> > --- a/kernel/rcu/tree_plugin.h
> > >> > +++ b/kernel/rcu/tree_plugin.h
> > >> > @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct 
> > >> > rcu_state *rsp, int cpu)
> > >> >rdp->nocb_leader = rdp_spawn;
> > >> >if (rdp_last && rdp != rdp_spawn)
> > >> >rdp_last->nocb_next_follower = rdp;
> > >> > -  rdp_last = rdp;
> > >> > -  rdp = rdp->nocb_next_follower;
> > >> > -  rdp_last->nocb_next_follower = NULL;
> > >> > +  if (rdp == rdp_spawn) {
> > >> > +  rdp = rdp->nocb_next_follower;
> > >> > +  } else {
> > >> > +  rdp_last = rdp;
> > >> > + 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 11:49:48AM -0700, Jay Vosburgh wrote:
> Paul E. McKenney  wrote:
> 
> >On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
> >> On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
> >> > On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
> >> > > On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
> >> > > > On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
> >> > > > > On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
> >> > > > > > On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
> >> > > > > > > On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
> >> > > > > > > > On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
> >> > > > > > > > > 
> >> > > > > > > > > On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
> >> > > > > > > > > > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti 
> >> > > > > > > > > > wrote:
> >> > > > 
> >> > > > [ . . . ]
> >> > > > 
> >> > > > > > > Ok, unless I've messsed up something major, bisecting points 
> >> > > > > > > to:
> >> > > > > > > 
> >> > > > > > > 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
> >> > > > > > > 
> >> > > > > > > Makes any sense ?
> >> > > > > > 
> >> > > > > > Good question.  ;-)
> >> > > > > > 
> >> > > > > > Are any of your online CPUs missing rcuo kthreads?  There should 
> >> > > > > > be
> >> > > > > > kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each 
> >> > > > > > online CPU.
> >> > > > > 
> >> > > > > Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a 
> >> > > > > reverted, the rcuos are 8
> >> > > > > and the modprobe ppp_generic testcase reliably works, libvirt also 
> >> > > > > manages
> >> > > > > to setup its bridge.
> >> > > > > 
> >> > > > > Just with linux-tip , the rcuos are 6 but the failure is as 
> >> > > > > reliable as
> >> > > > > before.
> >> > > 
> >> > > > Thank you, very interesting.  Which 6 of the rcuos are present?
> >> > > 
> >> > > Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like 
> >> > > this   
> >> > > Phenom II.
> >> > 
> >> > Ah, you get 8 without the patch because it creates them for potential
> >> > CPUs as well as real ones.  OK, got it.
> >> > 
> >> > > > > Awating instructions: :)
> >> > > > 
> >> > > > Well, I thought I understood the problem until you found that only 6 
> >> > > > of
> >> > > > the expected 8 rcuos are present with linux-tip without the revert.  
> >> > > > ;-)
> >> > > > 
> >> > > > I am putting together a patch for the part of the problem that I 
> >> > > > think
> >> > > > I understand, of course, but it would help a lot to know which two of
> >> > > > the rcuos are missing.  ;-)
> >> > > 
> >> > > Ready to test
> >> > 
> >> > Well, if you are feeling aggressive, give the following patch a spin.
> >> > I am doing sanity tests on it in the meantime.
> >> 
> >> Doesn't seem to make a difference here
> >
> >OK, inspection isn't cutting it, so time for tracing.  Does the system
> >respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
> >the problem occurs, then dump the trace buffer after the problem occurs.
> 
>   My system is up and responsive when the problem occurs, so this
> shouldn't be a problem.

Nice!  ;-)

>   Do you want the ftrace with your patch below, or unmodified tip
> of tree?

Let's please start with the patch.

Thanx, Paul

>   -J
> 
> 
> > Thanx, Paul
> >
> >> > 
> >> > 
> >> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> >> > index 29fb23f33c18..927c17b081c7 100644
> >> > --- a/kernel/rcu/tree_plugin.h
> >> > +++ b/kernel/rcu/tree_plugin.h
> >> > @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct 
> >> > rcu_state *rsp, int cpu)
> >> >  rdp->nocb_leader = rdp_spawn;
> >> >  if (rdp_last && rdp != rdp_spawn)
> >> >  rdp_last->nocb_next_follower = rdp;
> >> > -rdp_last = rdp;
> >> > -rdp = rdp->nocb_next_follower;
> >> > -rdp_last->nocb_next_follower = NULL;
> >> > +if (rdp == rdp_spawn) {
> >> > +rdp = rdp->nocb_next_follower;
> >> > +} else {
> >> > +rdp_last = rdp;
> >> > +rdp = rdp->nocb_next_follower;
> >> > +rdp_last->nocb_next_follower = NULL;
> >> > +}
> >> >  } while (rdp);
> >> >  rdp_spawn->nocb_next_follower = rdp_old_leader;
> >> >  }
> >> > 
> 
> ---
>   -Jay Vosburgh, jay.vosbu...@canonical.com
> 

--
To unsubscribe from this list: send the line "unsubscribe 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh
Paul E. McKenney  wrote:

>On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
>> On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
>> > On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
>> > > On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
>> > > > On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
>> > > > > On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
>> > > > > > On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
>> > > > > > > On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
>> > > > > > > > On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
>> > > > > > > > > 
>> > > > > > > > > On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
>> > > > > > > > > > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti 
>> > > > > > > > > > wrote:
>> > > > 
>> > > > [ . . . ]
>> > > > 
>> > > > > > > Ok, unless I've messsed up something major, bisecting points to:
>> > > > > > > 
>> > > > > > > 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
>> > > > > > > 
>> > > > > > > Makes any sense ?
>> > > > > > 
>> > > > > > Good question.  ;-)
>> > > > > > 
>> > > > > > Are any of your online CPUs missing rcuo kthreads?  There should be
>> > > > > > kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each 
>> > > > > > online CPU.
>> > > > > 
>> > > > > Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a 
>> > > > > reverted, the rcuos are 8
>> > > > > and the modprobe ppp_generic testcase reliably works, libvirt also 
>> > > > > manages
>> > > > > to setup its bridge.
>> > > > > 
>> > > > > Just with linux-tip , the rcuos are 6 but the failure is as reliable 
>> > > > > as
>> > > > > before.
>> > > 
>> > > > Thank you, very interesting.  Which 6 of the rcuos are present?
>> > > 
>> > > Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like 
>> > > this   
>> > > Phenom II.
>> > 
>> > Ah, you get 8 without the patch because it creates them for potential
>> > CPUs as well as real ones.  OK, got it.
>> > 
>> > > > > Awating instructions: :)
>> > > > 
>> > > > Well, I thought I understood the problem until you found that only 6 of
>> > > > the expected 8 rcuos are present with linux-tip without the revert.  
>> > > > ;-)
>> > > > 
>> > > > I am putting together a patch for the part of the problem that I think
>> > > > I understand, of course, but it would help a lot to know which two of
>> > > > the rcuos are missing.  ;-)
>> > > 
>> > > Ready to test
>> > 
>> > Well, if you are feeling aggressive, give the following patch a spin.
>> > I am doing sanity tests on it in the meantime.
>> 
>> Doesn't seem to make a difference here
>
>OK, inspection isn't cutting it, so time for tracing.  Does the system
>respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
>the problem occurs, then dump the trace buffer after the problem occurs.

My system is up and responsive when the problem occurs, so this
shouldn't be a problem.

Do you want the ftrace with your patch below, or unmodified tip
of tree?

-J


>   Thanx, Paul
>
>> > 
>> > 
>> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>> > index 29fb23f33c18..927c17b081c7 100644
>> > --- a/kernel/rcu/tree_plugin.h
>> > +++ b/kernel/rcu/tree_plugin.h
>> > @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct 
>> > rcu_state *rsp, int cpu)
>> >rdp->nocb_leader = rdp_spawn;
>> >if (rdp_last && rdp != rdp_spawn)
>> >rdp_last->nocb_next_follower = rdp;
>> > -  rdp_last = rdp;
>> > -  rdp = rdp->nocb_next_follower;
>> > -  rdp_last->nocb_next_follower = NULL;
>> > +  if (rdp == rdp_spawn) {
>> > +  rdp = rdp->nocb_next_follower;
>> > +  } else {
>> > +  rdp_last = rdp;
>> > +  rdp = rdp->nocb_next_follower;
>> > +  rdp_last->nocb_next_follower = NULL;
>> > +  }
>> >} while (rdp);
>> >rdp_spawn->nocb_next_follower = rdp_old_leader;
>> >}
>> > 

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 11:20:11AM -0700, Jay Vosburgh wrote:
> Paul E. McKenney  wrote:
> 
> >On Thu, Oct 23, 2014 at 09:48:34PM -0700, Jay Vosburgh wrote:
> >> Paul E. McKenney  wrote:
> [...]
> >> >Either way, my patch assumed that 39953dfd4007 (rcu: Avoid misordering in
> >> >__call_rcu_nocb_enqueue()) would work and that 1772947bd012 (rcu: Handle
> >> >NOCB callbacks from irq-disabled idle code) would fail.  Is that the case?
> >> >If not, could you please bisect the commits between 11ed7f934cb8 (rcu:
> >> >Make nocb leader kthreads process pending callbacks after spawning)
> >> >and c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())?
> >> 
> >>Just a note to add that I am also reliably inducing what appears
> >> to be this issue on a current -net tree, when configuring openvswitch
> >> via script.  I am available to test patches or bisect tomorrow (Friday)
> >> US time if needed.
> >
> >Thank you, Jay!  Could you please check to see if reverting this commit
> >fixes things for you?
> >
> >35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
> >
> >Reverting is not a long-term fix, as this commit is itself a bug fix,
> >but would be good to check to see if you are seeing the same thing that
> >Yanko is.  ;-)
> 
>   Just to confirm what Yanko found, reverting this commit makes
> the problem go away for me.

Thank you!

I take it that the patches that don't help Yanko also don't help you?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
> On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
> > On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
> > > On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
> > > > On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
> > > > > On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
> > > > > > On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
> > > > > > > On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
> > > > > > > > On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
> > > > > > > > > 
> > > > > > > > > On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
> > > > > > > > > > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti 
> > > > > > > > > > wrote:
> > > > 
> > > > [ . . . ]
> > > > 
> > > > > > > Ok, unless I've messsed up something major, bisecting points to:
> > > > > > > 
> > > > > > > 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
> > > > > > > 
> > > > > > > Makes any sense ?
> > > > > > 
> > > > > > Good question.  ;-)
> > > > > > 
> > > > > > Are any of your online CPUs missing rcuo kthreads?  There should be
> > > > > > kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each online 
> > > > > > CPU.
> > > > > 
> > > > > Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a 
> > > > > reverted, the rcuos are 8
> > > > > and the modprobe ppp_generic testcase reliably works, libvirt also 
> > > > > manages
> > > > > to setup its bridge.
> > > > > 
> > > > > Just with linux-tip , the rcuos are 6 but the failure is as reliable 
> > > > > as
> > > > > before.
> > > 
> > > > Thank you, very interesting.  Which 6 of the rcuos are present?
> > > 
> > > Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like this 
> > >   
> > > Phenom II.
> > 
> > Ah, you get 8 without the patch because it creates them for potential
> > CPUs as well as real ones.  OK, got it.
> > 
> > > > > Awating instructions: :)
> > > > 
> > > > Well, I thought I understood the problem until you found that only 6 of
> > > > the expected 8 rcuos are present with linux-tip without the revert.  ;-)
> > > > 
> > > > I am putting together a patch for the part of the problem that I think
> > > > I understand, of course, but it would help a lot to know which two of
> > > > the rcuos are missing.  ;-)
> > > 
> > > Ready to test
> > 
> > Well, if you are feeling aggressive, give the following patch a spin.
> > I am doing sanity tests on it in the meantime.
> 
> Doesn't seem to make a difference here

OK, inspection isn't cutting it, so time for tracing.  Does the system
respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
the problem occurs, then dump the trace buffer after the problem occurs.

Thanx, Paul

> > 
> > 
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 29fb23f33c18..927c17b081c7 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct 
> > rcu_state *rsp, int cpu)
> > rdp->nocb_leader = rdp_spawn;
> > if (rdp_last && rdp != rdp_spawn)
> > rdp_last->nocb_next_follower = rdp;
> > -   rdp_last = rdp;
> > -   rdp = rdp->nocb_next_follower;
> > -   rdp_last->nocb_next_follower = NULL;
> > +   if (rdp == rdp_spawn) {
> > +   rdp = rdp->nocb_next_follower;
> > +   } else {
> > +   rdp_last = rdp;
> > +   rdp = rdp->nocb_next_follower;
> > +   rdp_last->nocb_next_follower = NULL;
> > +   }
> > } while (rdp);
> > rdp_spawn->nocb_next_follower = rdp_old_leader;
> > }
> > 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh
Paul E. McKenney  wrote:

>On Thu, Oct 23, 2014 at 09:48:34PM -0700, Jay Vosburgh wrote:
>> Paul E. McKenney  wrote:
[...]
>> >Either way, my patch assumed that 39953dfd4007 (rcu: Avoid misordering in
>> >__call_rcu_nocb_enqueue()) would work and that 1772947bd012 (rcu: Handle
>> >NOCB callbacks from irq-disabled idle code) would fail.  Is that the case?
>> >If not, could you please bisect the commits between 11ed7f934cb8 (rcu:
>> >Make nocb leader kthreads process pending callbacks after spawning)
>> >and c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())?
>> 
>>  Just a note to add that I am also reliably inducing what appears
>> to be this issue on a current -net tree, when configuring openvswitch
>> via script.  I am available to test patches or bisect tomorrow (Friday)
>> US time if needed.
>
>Thank you, Jay!  Could you please check to see if reverting this commit
>fixes things for you?
>
>35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
>
>Reverting is not a long-term fix, as this commit is itself a bug fix,
>but would be good to check to see if you are seeing the same thing that
>Yanko is.  ;-)

Just to confirm what Yanko found, reverting this commit makes
the problem go away for me.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Yanko Kaneti
On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
> On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
> > On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
> > > On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
> > > > On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
> > > > > On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
> > > > > > On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
> > > > > > > On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
> > > > > > > > 
> > > > > > > > On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
> > > > > > > > > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > > > Ok, unless I've messsed up something major, bisecting points to:
> > > > > > 
> > > > > > 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
> > > > > > 
> > > > > > Makes any sense ?
> > > > > 
> > > > > Good question.  ;-)
> > > > > 
> > > > > Are any of your online CPUs missing rcuo kthreads?  There should be
> > > > > kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each online 
> > > > > CPU.
> > > > 
> > > > Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a reverted, 
> > > > the rcuos are 8
> > > > and the modprobe ppp_generic testcase reliably works, libvirt also 
> > > > manages
> > > > to setup its bridge.
> > > > 
> > > > Just with linux-tip , the rcuos are 6 but the failure is as reliable as
> > > > before.
> > 
> > > Thank you, very interesting.  Which 6 of the rcuos are present?
> > 
> > Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like this   
> > Phenom II.
> 
> Ah, you get 8 without the patch because it creates them for potential
> CPUs as well as real ones.  OK, got it.
> 
> > > > Awating instructions: :)
> > > 
> > > Well, I thought I understood the problem until you found that only 6 of
> > > the expected 8 rcuos are present with linux-tip without the revert.  ;-)
> > > 
> > > I am putting together a patch for the part of the problem that I think
> > > I understand, of course, but it would help a lot to know which two of
> > > the rcuos are missing.  ;-)
> > 
> > Ready to test
> 
> Well, if you are feeling aggressive, give the following patch a spin.
> I am doing sanity tests on it in the meantime.

Doesn't seem to make a difference here

 
>   Thanx, Paul
> 
> 
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 29fb23f33c18..927c17b081c7 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct 
> rcu_state *rsp, int cpu)
>   rdp->nocb_leader = rdp_spawn;
>   if (rdp_last && rdp != rdp_spawn)
>   rdp_last->nocb_next_follower = rdp;
> - rdp_last = rdp;
> - rdp = rdp->nocb_next_follower;
> - rdp_last->nocb_next_follower = NULL;
> + if (rdp == rdp_spawn) {
> + rdp = rdp->nocb_next_follower;
> + } else {
> + rdp_last = rdp;
> + rdp = rdp->nocb_next_follower;
> + rdp_last->nocb_next_follower = NULL;
> + }
>   } while (rdp);
>   rdp_spawn->nocb_next_follower = rdp_old_leader;
>   }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
> On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
> > On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
> > > On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
> > > > On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
> > > > > On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
> > > > > > On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
> > > > > > > 
> > > > > > > On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
> > > > > > > > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
> > 
> > [ . . . ]
> > 
> > > > > Ok, unless I've messsed up something major, bisecting points to:
> > > > > 
> > > > > 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
> > > > > 
> > > > > Makes any sense ?
> > > > 
> > > > Good question.  ;-)
> > > > 
> > > > Are any of your online CPUs missing rcuo kthreads?  There should be
> > > > kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each online CPU.
> > > 
> > > Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a reverted, 
> > > the rcuos are 8
> > > and the modprobe ppp_generic testcase reliably works, libvirt also manages
> > > to setup its bridge.
> > > 
> > > Just with linux-tip , the rcuos are 6 but the failure is as reliable as
> > > before.
> 
> > Thank you, very interesting.  Which 6 of the rcuos are present?
> 
> Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like this   
> Phenom II.

Ah, you get 8 without the patch because it creates them for potential
CPUs as well as real ones.  OK, got it.

> > > Awating instructions: :)
> > 
> > Well, I thought I understood the problem until you found that only 6 of
> > the expected 8 rcuos are present with linux-tip without the revert.  ;-)
> > 
> > I am putting together a patch for the part of the problem that I think
> > I understand, of course, but it would help a lot to know which two of
> > the rcuos are missing.  ;-)
> 
> Ready to test

Well, if you are feeling aggressive, give the following patch a spin.
I am doing sanity tests on it in the meantime.

Thanx, Paul



diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 29fb23f33c18..927c17b081c7 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct rcu_state 
*rsp, int cpu)
rdp->nocb_leader = rdp_spawn;
if (rdp_last && rdp != rdp_spawn)
rdp_last->nocb_next_follower = rdp;
-   rdp_last = rdp;
-   rdp = rdp->nocb_next_follower;
-   rdp_last->nocb_next_follower = NULL;
+   if (rdp == rdp_spawn) {
+   rdp = rdp->nocb_next_follower;
+   } else {
+   rdp_last = rdp;
+   rdp = rdp->nocb_next_follower;
+   rdp_last->nocb_next_follower = NULL;
+   }
} while (rdp);
rdp_spawn->nocb_next_follower = rdp_old_leader;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Yanko Kaneti
On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
> On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
> > On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
> > > On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
> > > > On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
> > > > > On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
> > > > > > 
> > > > > > On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
> > > > > > > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
> 
> [ . . . ]
> 
> > > > Ok, unless I've messsed up something major, bisecting points to:
> > > > 
> > > > 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
> > > > 
> > > > Makes any sense ?
> > > 
> > > Good question.  ;-)
> > > 
> > > Are any of your online CPUs missing rcuo kthreads?  There should be
> > > kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each online CPU.
> > 
> > Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a reverted, the 
> > rcuos are 8
> > and the modprobe ppp_generic testcase reliably works, libvirt also manages
> > to setup its bridge.
> > 
> > Just with linux-tip , the rcuos are 6 but the failure is as reliable as
> > before.

 
> Thank you, very interesting.  Which 6 of the rcuos are present?

Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like this   
Phenom II.

 
> > Awating instructions: :)
> 
> Well, I thought I understood the problem until you found that only 6 of
> the expected 8 rcuos are present with linux-tip without the revert.  ;-)
> 
> I am putting together a patch for the part of the problem that I think
> I understand, of course, but it would help a lot to know which two of
> the rcuos are missing.  ;-)
>

Ready to test

--Yanko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
> On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
> > On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
> > > On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
> > > > On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
> > > > > 
> > > > > On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
> > > > > > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:

[ . . . ]

> > > Ok, unless I've messsed up something major, bisecting points to:
> > > 
> > > 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
> > > 
> > > Makes any sense ?
> > 
> > Good question.  ;-)
> > 
> > Are any of your online CPUs missing rcuo kthreads?  There should be
> > kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each online CPU.
> 
> Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a reverted, the 
> rcuos are 8
> and the modprobe ppp_generic testcase reliably works, libvirt also manages
> to setup its bridge.
> 
> Just with linux-tip , the rcuos are 6 but the failure is as reliable as
> before.

Thank you, very interesting.  Which 6 of the rcuos are present?

> Awating instructions: :)

Well, I thought I understood the problem until you found that only 6 of
the expected 8 rcuos are present with linux-tip without the revert.  ;-)

I am putting together a patch for the part of the problem that I think
I understand, of course, but it would help a lot to know which two of
the rcuos are missing.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Yanko Kaneti
On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
> On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
> > On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
> > > On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
> > > > 
> > > > On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
> > > > > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
> 
> [ . . . ]
> 
> > > > > > Indeed, c847f14217d5 it is.
> > > > > > 
> > > > > > Much to my embarrasment I just noticed that in addition to the
> > > > > > rcu merge, triggering the bug "requires" my specific Fedora 
> > > > > > rawhide network
> > > > > > setup. Booting in single mode and modprobe ppp_generic is fine. 
> > > > > > The bug
> > > > > > appears when starting with my regular fedora network setup, which 
> > > > > > in my case
> > > > > > includes 3 ethernet adapters and a libvirt birdge+nat setup.
> > > > > > 
> > > > > > Hope that helps.
> > > > > > 
> > > > > > I am attaching the config.
> > > > > 
> > > > > It does help a lot, thank you!!!
> > > > > 
> > > > > The following patch is a bit of a shot in the dark, and assumes that
> > > > > commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled 
> > > > > idle
> > > > > code) introduced the problem.  Does this patch fix things up?
> > > > 
> > > > Unfortunately not, This is linus-tip + patch
> > > 
> > > OK.  Can't have everything, I guess.
> > > 
> > > > INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
> > > >   Not tainted 3.18.0-rc1+ #4
> > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
> > > > message.
> > > > kworker/u16:6   D 8800ca84cec0 1116896  2 0x
> > > > Workqueue: netns cleanup_net
> > > >  8802218339e8 0096 8800ca84cec0 001d5f00
> > > >  880221833fd8 001d5f00 880223264ec0 8800ca84cec0
> > > >  82c52040 7fff 81ee2658 81ee2650
> > > > Call Trace:
> > > >  [] schedule+0x29/0x70
> > > >  [] schedule_timeout+0x26c/0x410
> > > >  [] ? native_sched_clock+0x2a/0xa0
> > > >  [] ? mark_held_locks+0x7c/0xb0
> > > >  [] ? _raw_spin_unlock_irq+0x30/0x50
> > > >  [] ? trace_hardirqs_on_caller+0x15d/0x200
> > > >  [] wait_for_completion+0x10c/0x150
> > > >  [] ? wake_up_state+0x20/0x20
> > > >  [] _rcu_barrier+0x159/0x200
> > > >  [] rcu_barrier+0x15/0x20
> > > >  [] netdev_run_todo+0x6f/0x310
> > > >  [] ? rollback_registered_many+0x265/0x2e0
> > > >  [] rtnl_unlock+0xe/0x10
> > > >  [] default_device_exit_batch+0x156/0x180
> > > >  [] ? abort_exclusive_wait+0xb0/0xb0
> > > >  [] ops_exit_list.isra.1+0x53/0x60
> > > >  [] cleanup_net+0x100/0x1f0
> > > >  [] process_one_work+0x218/0x850
> > > >  [] ? process_one_work+0x17f/0x850
> > > >  [] ? worker_thread+0xe7/0x4a0
> > > >  [] worker_thread+0x6b/0x4a0
> > > >  [] ? process_one_work+0x850/0x850
> > > >  [] kthread+0x10b/0x130
> > > >  [] ? sched_clock+0x9/0x10
> > > >  [] ? kthread_create_on_node+0x250/0x250
> > > >  [] ret_from_fork+0x7c/0xb0
> > > >  [] ? kthread_create_on_node+0x250/0x250
> > > > 4 locks held by kworker/u16:6/96:
> > > >  #0:  ("%s""netns"){.+.+.+}, at: [] 
> > > > process_one_work+0x17f/0x850
> > > >  #1:  (net_cleanup_work){+.+.+.}, at: [] 
> > > > process_one_work+0x17f/0x850
> > > >  #2:  (net_mutex){+.+.+.}, at: [] 
> > > > cleanup_net+0x8c/0x1f0
> > > >  #3:  (rcu_sched_state.barrier_mutex){+.+...}, at: [] 
> > > > _rcu_barrier+0x35/0x200
> > > > INFO: task modprobe:1045 blocked for more than 120 seconds.
> > > >   Not tainted 3.18.0-rc1+ #4
> > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
> > > > message.
> > > > modprobeD 880218343480 12920  1045   1044 0x0080
> > > >  880218353bf8 0096 880218343480 001d5f00
> > > >  880218353fd8 001d5f00 81e1b580 880218343480
> > > >  880218343480 81f8f748 0246 880218343480
> > > > Call Trace:
> > > >  [] schedule_preempt_disabled+0x31/0x80
> > > >  [] mutex_lock_nested+0x183/0x440
> > > >  [] ? register_pernet_subsys+0x1f/0x50
> > > >  [] ? register_pernet_subsys+0x1f/0x50
> > > >  [] ? 0xa0673000
> > > >  [] register_pernet_subsys+0x1f/0x50
> > > >  [] br_init+0x48/0xd3 [bridge]
> > > >  [] do_one_initcall+0xd8/0x210
> > > >  [] load_module+0x20c2/0x2870
> > > >  [] ? store_uevent+0x70/0x70
> > > >  [] ? kernel_read+0x57/0x90
> > > >  [] SyS_finit_module+0xa6/0xe0
> > > >  [] system_call_fastpath+0x12/0x17
> > > > 1 lock held by modprobe/1045:
> > > >  #0:  (net_mutex){+.+.+.}, at: [] 
> > > > register_pernet_subsys+0x1f/0x50
> > > 
> > > Presumably the kworker/u16:6 completed, then modprobe hung?
> > > 
> > > If not, I have some very hard questions about why net_mutex can be
> > > held by two tasks concurrently, given that it does not appear to be a
> > > reader-writer lock...
> > > 
> > > Either way, my patch assumed that 39953dfd4007 (rcu: Avoid misordering 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
> On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
> > On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
> > > 
> > > On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
> > > > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:

[ . . . ]

> > > > > Indeed, c847f14217d5 it is.
> > > > > 
> > > > > Much to my embarrasment I just noticed that in addition to the
> > > > > rcu merge, triggering the bug "requires" my specific Fedora 
> > > > > rawhide network
> > > > > setup. Booting in single mode and modprobe ppp_generic is fine. 
> > > > > The bug
> > > > > appears when starting with my regular fedora network setup, which 
> > > > > in my case
> > > > > includes 3 ethernet adapters and a libvirt birdge+nat setup.
> > > > > 
> > > > > Hope that helps.
> > > > > 
> > > > > I am attaching the config.
> > > > 
> > > > It does help a lot, thank you!!!
> > > > 
> > > > The following patch is a bit of a shot in the dark, and assumes that
> > > > commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled 
> > > > idle
> > > > code) introduced the problem.  Does this patch fix things up?
> > > 
> > > Unfortunately not, This is linus-tip + patch
> > 
> > OK.  Can't have everything, I guess.
> > 
> > > INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
> > >   Not tainted 3.18.0-rc1+ #4
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > kworker/u16:6   D 8800ca84cec0 1116896  2 0x
> > > Workqueue: netns cleanup_net
> > >  8802218339e8 0096 8800ca84cec0 001d5f00
> > >  880221833fd8 001d5f00 880223264ec0 8800ca84cec0
> > >  82c52040 7fff 81ee2658 81ee2650
> > > Call Trace:
> > >  [] schedule+0x29/0x70
> > >  [] schedule_timeout+0x26c/0x410
> > >  [] ? native_sched_clock+0x2a/0xa0
> > >  [] ? mark_held_locks+0x7c/0xb0
> > >  [] ? _raw_spin_unlock_irq+0x30/0x50
> > >  [] ? trace_hardirqs_on_caller+0x15d/0x200
> > >  [] wait_for_completion+0x10c/0x150
> > >  [] ? wake_up_state+0x20/0x20
> > >  [] _rcu_barrier+0x159/0x200
> > >  [] rcu_barrier+0x15/0x20
> > >  [] netdev_run_todo+0x6f/0x310
> > >  [] ? rollback_registered_many+0x265/0x2e0
> > >  [] rtnl_unlock+0xe/0x10
> > >  [] default_device_exit_batch+0x156/0x180
> > >  [] ? abort_exclusive_wait+0xb0/0xb0
> > >  [] ops_exit_list.isra.1+0x53/0x60
> > >  [] cleanup_net+0x100/0x1f0
> > >  [] process_one_work+0x218/0x850
> > >  [] ? process_one_work+0x17f/0x850
> > >  [] ? worker_thread+0xe7/0x4a0
> > >  [] worker_thread+0x6b/0x4a0
> > >  [] ? process_one_work+0x850/0x850
> > >  [] kthread+0x10b/0x130
> > >  [] ? sched_clock+0x9/0x10
> > >  [] ? kthread_create_on_node+0x250/0x250
> > >  [] ret_from_fork+0x7c/0xb0
> > >  [] ? kthread_create_on_node+0x250/0x250
> > > 4 locks held by kworker/u16:6/96:
> > >  #0:  ("%s""netns"){.+.+.+}, at: [] 
> > > process_one_work+0x17f/0x850
> > >  #1:  (net_cleanup_work){+.+.+.}, at: [] 
> > > process_one_work+0x17f/0x850
> > >  #2:  (net_mutex){+.+.+.}, at: [] cleanup_net+0x8c/0x1f0
> > >  #3:  (rcu_sched_state.barrier_mutex){+.+...}, at: [] 
> > > _rcu_barrier+0x35/0x200
> > > INFO: task modprobe:1045 blocked for more than 120 seconds.
> > >   Not tainted 3.18.0-rc1+ #4
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > modprobeD 880218343480 12920  1045   1044 0x0080
> > >  880218353bf8 0096 880218343480 001d5f00
> > >  880218353fd8 001d5f00 81e1b580 880218343480
> > >  880218343480 81f8f748 0246 880218343480
> > > Call Trace:
> > >  [] schedule_preempt_disabled+0x31/0x80
> > >  [] mutex_lock_nested+0x183/0x440
> > >  [] ? register_pernet_subsys+0x1f/0x50
> > >  [] ? register_pernet_subsys+0x1f/0x50
> > >  [] ? 0xa0673000
> > >  [] register_pernet_subsys+0x1f/0x50
> > >  [] br_init+0x48/0xd3 [bridge]
> > >  [] do_one_initcall+0xd8/0x210
> > >  [] load_module+0x20c2/0x2870
> > >  [] ? store_uevent+0x70/0x70
> > >  [] ? kernel_read+0x57/0x90
> > >  [] SyS_finit_module+0xa6/0xe0
> > >  [] system_call_fastpath+0x12/0x17
> > > 1 lock held by modprobe/1045:
> > >  #0:  (net_mutex){+.+.+.}, at: [] 
> > > register_pernet_subsys+0x1f/0x50
> > 
> > Presumably the kworker/u16:6 completed, then modprobe hung?
> > 
> > If not, I have some very hard questions about why net_mutex can be
> > held by two tasks concurrently, given that it does not appear to be a
> > reader-writer lock...
> > 
> > Either way, my patch assumed that 39953dfd4007 (rcu: Avoid misordering in
> > __call_rcu_nocb_enqueue()) would work and that 1772947bd012 (rcu: Handle
> > NOCB callbacks from irq-disabled idle code) would fail.  Is that the case?
> > If not, could you please bisect the commits between 11ed7f934cb8 (rcu:
> > Make nocb leader kthreads process pending callbacks after 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Thu, Oct 23, 2014 at 09:48:34PM -0700, Jay Vosburgh wrote:
> Paul E. McKenney  wrote:
> 
> >On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
> >> 
> >> On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
> >> > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
> >> > > On Thu-10/23/14-2014 08:33, Paul E. McKenney wrote:
> >> > > > On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
> >> > > > > On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
> >> > > > > > On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
> >> > > > > > > On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti 
> >> > > > > > > wrote:
> >> > > > > > > > On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
> >> > > > > > > > > On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
> >> > > > > > > > >  wrote:
> >> > > > > > > 
> >> > > > > > > [ . . . ]
> >> > > > > > > 
> >> > > > > > > > > > Don't get me wrong -- the fact that this kthread 
> >> > > > > > > > > > appears to
> >> > > > > > > > > > have
> >> > > > > > > > > > blocked within rcu_barrier() for 120 seconds means 
> >> > > > > > > > > > that
> >> > > > > > > > > > something is
> >> > > > > > > > > > most definitely wrong here.  I am surprised that 
> >> > > > > > > > > > there are no
> >> > > > > > > > > > RCU CPU
> >> > > > > > > > > > stall warnings, but perhaps the blockage is in the 
> >> > > > > > > > > > callback
> >> > > > > > > > > > execution
> >> > > > > > > > > > rather than grace-period completion.  Or something is
> >> > > > > > > > > > preventing this
> >> > > > > > > > > > kthread from starting up after the wake-up callback 
> >> > > > > > > > > > executes.
> >> > > > > > > > > > Or...
> >> > > > > > > > > > 
> >> > > > > > > > > > Is this thing reproducible?
> >> > > > > > > > > 
> >> > > > > > > > > I've added Yanko on CC, who reported the backtrace 
> >> > > > > > > > > above and can
> >> > > > > > > > > recreate it reliably.  Apparently reverting the RCU 
> >> > > > > > > > > merge commit
> >> > > > > > > > > (d6dd50e) and rebuilding the latest after that does 
> >> > > > > > > > > not show the
> >> > > > > > > > > issue.  I'll let Yanko explain more and answer any 
> >> > > > > > > > > questions you
> >> > > > > > > > > have.
> >> > > > > > > > 
> >> > > > > > > > - It is reproducible
> >> > > > > > > > - I've done another build here to double check and its 
> >> > > > > > > > definitely
> >> > > > > > > > the rcu merge
> >> > > > > > > >   that's causing it.
> >> > > > > > > > 
> >> > > > > > > > Don't think I'll be able to dig deeper, but I can do 
> >> > > > > > > > testing if
> >> > > > > > > > needed.
> >> > > > > > > 
> >> > > > > > > Please!  Does the following patch help?
> >> > > > > > 
> >> > > > > > Nope, doesn't seem to make a difference to the modprobe 
> >> > > > > > ppp_generic
> >> > > > > > test
> >> > > > > 
> >> > > > > Well, I was hoping.  I will take a closer look at the RCU 
> >> > > > > merge commit
> >> > > > > and see what suggests itself.  I am likely to ask you to 
> >> > > > > revert specific
> >> > > > > commits, if that works for you.
> >> > > > 
> >> > > > Well, rather than reverting commits, could you please try 
> >> > > > testing the
> >> > > > following commits?
> >> > > > 
> >> > > > 11ed7f934cb8 (rcu: Make nocb leader kthreads process pending 
> >> > > > callbacks after spawning)
> >> > > > 
> >> > > > 73a860cd58a1 (rcu: Replace flush_signals() with 
> >> > > > WARN_ON(signal_pending()))
> >> > > > 
> >> > > > c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())
> >> > > > 
> >> > > > For whatever it is worth, I am guessing this one.
> >> > > 
> >> > > Indeed, c847f14217d5 it is.
> >> > > 
> >> > > Much to my embarrasment I just noticed that in addition to the
> >> > > rcu merge, triggering the bug "requires" my specific Fedora 
> >> > > rawhide network
> >> > > setup. Booting in single mode and modprobe ppp_generic is fine. 
> >> > > The bug
> >> > > appears when starting with my regular fedora network setup, which 
> >> > > in my case
> >> > > includes 3 ethernet adapters and a libvirt birdge+nat setup.
> >> > > 
> >> > > Hope that helps.
> >> > > 
> >> > > I am attaching the config.
> >> > 
> >> > It does help a lot, thank you!!!
> >> > 
> >> > The following patch is a bit of a shot in the dark, and assumes that
> >> > commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled 
> >> > idle
> >> > code) introduced the problem.  Does this patch fix things up?
> >> 
> >> Unfortunately not, This is linus-tip + patch
> >
> >OK.  Can't have everything, I guess.
> >
> >> INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
> >>   Not tainted 3.18.0-rc1+ #4
> >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> kworker/u16:6   D 8800ca84cec0 1116896  2 0x
> >> Workqueue: netns cleanup_net
> >>  8802218339e8 0096 8800ca84cec0 001d5f00
> >>  880221833fd8 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Yanko Kaneti
On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
> On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
> > 
> > On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
> > > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
> > > > On Thu-10/23/14-2014 08:33, Paul E. McKenney wrote:
> > > > > On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
> > > > > > On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
> > > > > > > On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
> > > > > > > > On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti 
> > > > > > > > wrote:
> > > > > > > > > On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
> > > > > > > > > > On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
> > > > > > > > > >  wrote:
> > > > > > > > 
> > > > > > > > [ . . . ]
> > > > > > > > 
> > > > > > > > > > > Don't get me wrong -- the fact that this kthread 
> > > > > > > > > > > appears to
> > > > > > > > > > > have
> > > > > > > > > > > blocked within rcu_barrier() for 120 seconds means 
> > > > > > > > > > > that
> > > > > > > > > > > something is
> > > > > > > > > > > most definitely wrong here.  I am surprised that 
> > > > > > > > > > > there are no
> > > > > > > > > > > RCU CPU
> > > > > > > > > > > stall warnings, but perhaps the blockage is in the 
> > > > > > > > > > > callback
> > > > > > > > > > > execution
> > > > > > > > > > > rather than grace-period completion.  Or something is
> > > > > > > > > > > preventing this
> > > > > > > > > > > kthread from starting up after the wake-up callback 
> > > > > > > > > > > executes.
> > > > > > > > > > > Or...
> > > > > > > > > > > 
> > > > > > > > > > > Is this thing reproducible?
> > > > > > > > > > 
> > > > > > > > > > I've added Yanko on CC, who reported the backtrace 
> > > > > > > > > > above and can
> > > > > > > > > > recreate it reliably.  Apparently reverting the RCU 
> > > > > > > > > > merge commit
> > > > > > > > > > (d6dd50e) and rebuilding the latest after that does 
> > > > > > > > > > not show the
> > > > > > > > > > issue.  I'll let Yanko explain more and answer any 
> > > > > > > > > > questions you
> > > > > > > > > > have.
> > > > > > > > > 
> > > > > > > > > - It is reproducible
> > > > > > > > > - I've done another build here to double check and its 
> > > > > > > > > definitely
> > > > > > > > > the rcu merge
> > > > > > > > >   that's causing it.
> > > > > > > > > 
> > > > > > > > > Don't think I'll be able to dig deeper, but I can do 
> > > > > > > > > testing if
> > > > > > > > > needed.
> > > > > > > > 
> > > > > > > > Please!  Does the following patch help?
> > > > > > > 
> > > > > > > Nope, doesn't seem to make a difference to the modprobe 
> > > > > > > ppp_generic
> > > > > > > test
> > > > > > 
> > > > > > Well, I was hoping.  I will take a closer look at the RCU 
> > > > > > merge commit
> > > > > > and see what suggests itself.  I am likely to ask you to 
> > > > > > revert specific
> > > > > > commits, if that works for you.
> > > > > 
> > > > > Well, rather than reverting commits, could you please try 
> > > > > testing the
> > > > > following commits?
> > > > > 
> > > > > 11ed7f934cb8 (rcu: Make nocb leader kthreads process pending 
> > > > > callbacks after spawning)
> > > > > 
> > > > > 73a860cd58a1 (rcu: Replace flush_signals() with 
> > > > > WARN_ON(signal_pending()))
> > > > > 
> > > > > c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())
> > > > > 
> > > > > For whatever it is worth, I am guessing this one.
> > > > 
> > > > Indeed, c847f14217d5 it is.
> > > > 
> > > > Much to my embarrasment I just noticed that in addition to the
> > > > rcu merge, triggering the bug "requires" my specific Fedora 
> > > > rawhide network
> > > > setup. Booting in single mode and modprobe ppp_generic is fine. 
> > > > The bug
> > > > appears when starting with my regular fedora network setup, which 
> > > > in my case
> > > > includes 3 ethernet adapters and a libvirt birdge+nat setup.
> > > > 
> > > > Hope that helps.
> > > > 
> > > > I am attaching the config.
> > > 
> > > It does help a lot, thank you!!!
> > > 
> > > The following patch is a bit of a shot in the dark, and assumes that
> > > commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled 
> > > idle
> > > code) introduced the problem.  Does this patch fix things up?
> > 
> > Unfortunately not, This is linus-tip + patch
> 
> OK.  Can't have everything, I guess.
> 
> > INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
> >   Not tainted 3.18.0-rc1+ #4
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > kworker/u16:6   D 8800ca84cec0 1116896  2 0x
> > Workqueue: netns cleanup_net
> >  8802218339e8 0096 8800ca84cec0 001d5f00
> >  880221833fd8 001d5f00 880223264ec0 8800ca84cec0
> >  82c52040 7fff 81ee2658 81ee2650
> > Call Trace:
> >  [] 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Yanko Kaneti
On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
 On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
  
  On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
   On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
On Thu-10/23/14-2014 08:33, Paul E. McKenney wrote:
 On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
  On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
   On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti 
wrote:
 On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
  On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
  paul...@linux.vnet.ibm.com wrote:

[ . . . ]

   Don't get me wrong -- the fact that this kthread 
   appears to
   have
   blocked within rcu_barrier() for 120 seconds means 
   that
   something is
   most definitely wrong here.  I am surprised that 
   there are no
   RCU CPU
   stall warnings, but perhaps the blockage is in the 
   callback
   execution
   rather than grace-period completion.  Or something is
   preventing this
   kthread from starting up after the wake-up callback 
   executes.
   Or...
   
   Is this thing reproducible?
  
  I've added Yanko on CC, who reported the backtrace 
  above and can
  recreate it reliably.  Apparently reverting the RCU 
  merge commit
  (d6dd50e) and rebuilding the latest after that does 
  not show the
  issue.  I'll let Yanko explain more and answer any 
  questions you
  have.
 
 - It is reproducible
 - I've done another build here to double check and its 
 definitely
 the rcu merge
   that's causing it.
 
 Don't think I'll be able to dig deeper, but I can do 
 testing if
 needed.

Please!  Does the following patch help?
   
   Nope, doesn't seem to make a difference to the modprobe 
   ppp_generic
   test
  
  Well, I was hoping.  I will take a closer look at the RCU 
  merge commit
  and see what suggests itself.  I am likely to ask you to 
  revert specific
  commits, if that works for you.
 
 Well, rather than reverting commits, could you please try 
 testing the
 following commits?
 
 11ed7f934cb8 (rcu: Make nocb leader kthreads process pending 
 callbacks after spawning)
 
 73a860cd58a1 (rcu: Replace flush_signals() with 
 WARN_ON(signal_pending()))
 
 c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())
 
 For whatever it is worth, I am guessing this one.

Indeed, c847f14217d5 it is.

Much to my embarrasment I just noticed that in addition to the
rcu merge, triggering the bug requires my specific Fedora 
rawhide network
setup. Booting in single mode and modprobe ppp_generic is fine. 
The bug
appears when starting with my regular fedora network setup, which 
in my case
includes 3 ethernet adapters and a libvirt birdge+nat setup.

Hope that helps.

I am attaching the config.
   
   It does help a lot, thank you!!!
   
   The following patch is a bit of a shot in the dark, and assumes that
   commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled 
   idle
   code) introduced the problem.  Does this patch fix things up?
  
  Unfortunately not, This is linus-tip + patch
 
 OK.  Can't have everything, I guess.
 
  INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
Not tainted 3.18.0-rc1+ #4
  echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
  kworker/u16:6   D 8800ca84cec0 1116896  2 0x
  Workqueue: netns cleanup_net
   8802218339e8 0096 8800ca84cec0 001d5f00
   880221833fd8 001d5f00 880223264ec0 8800ca84cec0
   82c52040 7fff 81ee2658 81ee2650
  Call Trace:
   [8185b8e9] schedule+0x29/0x70
   [81860b0c] schedule_timeout+0x26c/0x410
   [81028bea] ? native_sched_clock+0x2a/0xa0
   [8110759c] ? mark_held_locks+0x7c/0xb0
   [81861b90] ? _raw_spin_unlock_irq+0x30/0x50
   [8110772d] ? trace_hardirqs_on_caller+0x15d/0x200
   [8185d31c] wait_for_completion+0x10c/0x150
   [810e4ed0] ? wake_up_state+0x20/0x20
   [8112a219] _rcu_barrier+0x159/0x200
   [8112a315] rcu_barrier+0x15/0x20
   [8171657f] netdev_run_todo+0x6f/0x310
   [8170b145] ? rollback_registered_many+0x265/0x2e0
   [817235ee] rtnl_unlock+0xe/0x10
   [8170cfa6] default_device_exit_batch+0x156/0x180
   [810fd390] ? 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Thu, Oct 23, 2014 at 09:48:34PM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
  
  On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
   On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
On Thu-10/23/14-2014 08:33, Paul E. McKenney wrote:
 On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
  On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
   On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti 
wrote:
 On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
  On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
  paul...@linux.vnet.ibm.com wrote:

[ . . . ]

   Don't get me wrong -- the fact that this kthread 
   appears to
   have
   blocked within rcu_barrier() for 120 seconds means 
   that
   something is
   most definitely wrong here.  I am surprised that 
   there are no
   RCU CPU
   stall warnings, but perhaps the blockage is in the 
   callback
   execution
   rather than grace-period completion.  Or something is
   preventing this
   kthread from starting up after the wake-up callback 
   executes.
   Or...
   
   Is this thing reproducible?
  
  I've added Yanko on CC, who reported the backtrace 
  above and can
  recreate it reliably.  Apparently reverting the RCU 
  merge commit
  (d6dd50e) and rebuilding the latest after that does 
  not show the
  issue.  I'll let Yanko explain more and answer any 
  questions you
  have.
 
 - It is reproducible
 - I've done another build here to double check and its 
 definitely
 the rcu merge
   that's causing it.
 
 Don't think I'll be able to dig deeper, but I can do 
 testing if
 needed.

Please!  Does the following patch help?
   
   Nope, doesn't seem to make a difference to the modprobe 
   ppp_generic
   test
  
  Well, I was hoping.  I will take a closer look at the RCU 
  merge commit
  and see what suggests itself.  I am likely to ask you to 
  revert specific
  commits, if that works for you.
 
 Well, rather than reverting commits, could you please try 
 testing the
 following commits?
 
 11ed7f934cb8 (rcu: Make nocb leader kthreads process pending 
 callbacks after spawning)
 
 73a860cd58a1 (rcu: Replace flush_signals() with 
 WARN_ON(signal_pending()))
 
 c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())
 
 For whatever it is worth, I am guessing this one.

Indeed, c847f14217d5 it is.

Much to my embarrasment I just noticed that in addition to the
rcu merge, triggering the bug requires my specific Fedora 
rawhide network
setup. Booting in single mode and modprobe ppp_generic is fine. 
The bug
appears when starting with my regular fedora network setup, which 
in my case
includes 3 ethernet adapters and a libvirt birdge+nat setup.

Hope that helps.

I am attaching the config.
   
   It does help a lot, thank you!!!
   
   The following patch is a bit of a shot in the dark, and assumes that
   commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled 
   idle
   code) introduced the problem.  Does this patch fix things up?
  
  Unfortunately not, This is linus-tip + patch
 
 OK.  Can't have everything, I guess.
 
  INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
Not tainted 3.18.0-rc1+ #4
  echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
  kworker/u16:6   D 8800ca84cec0 1116896  2 0x
  Workqueue: netns cleanup_net
   8802218339e8 0096 8800ca84cec0 001d5f00
   880221833fd8 001d5f00 880223264ec0 8800ca84cec0
   82c52040 7fff 81ee2658 81ee2650
  Call Trace:
   [8185b8e9] schedule+0x29/0x70
   [81860b0c] schedule_timeout+0x26c/0x410
   [81028bea] ? native_sched_clock+0x2a/0xa0
   [8110759c] ? mark_held_locks+0x7c/0xb0
   [81861b90] ? _raw_spin_unlock_irq+0x30/0x50
   [8110772d] ? trace_hardirqs_on_caller+0x15d/0x200
   [8185d31c] wait_for_completion+0x10c/0x150
   [810e4ed0] ? wake_up_state+0x20/0x20
   [8112a219] _rcu_barrier+0x159/0x200
   [8112a315] rcu_barrier+0x15/0x20
   [8171657f] netdev_run_todo+0x6f/0x310
   [8170b145] ? rollback_registered_many+0x265/0x2e0
   [817235ee] rtnl_unlock+0xe/0x10
   [8170cfa6] 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
 On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
  On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
   
   On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:

[ . . . ]

 Indeed, c847f14217d5 it is.
 
 Much to my embarrasment I just noticed that in addition to the
 rcu merge, triggering the bug requires my specific Fedora 
 rawhide network
 setup. Booting in single mode and modprobe ppp_generic is fine. 
 The bug
 appears when starting with my regular fedora network setup, which 
 in my case
 includes 3 ethernet adapters and a libvirt birdge+nat setup.
 
 Hope that helps.
 
 I am attaching the config.

It does help a lot, thank you!!!

The following patch is a bit of a shot in the dark, and assumes that
commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled 
idle
code) introduced the problem.  Does this patch fix things up?
   
   Unfortunately not, This is linus-tip + patch
  
  OK.  Can't have everything, I guess.
  
   INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
 Not tainted 3.18.0-rc1+ #4
   echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
   kworker/u16:6   D 8800ca84cec0 1116896  2 0x
   Workqueue: netns cleanup_net
8802218339e8 0096 8800ca84cec0 001d5f00
880221833fd8 001d5f00 880223264ec0 8800ca84cec0
82c52040 7fff 81ee2658 81ee2650
   Call Trace:
[8185b8e9] schedule+0x29/0x70
[81860b0c] schedule_timeout+0x26c/0x410
[81028bea] ? native_sched_clock+0x2a/0xa0
[8110759c] ? mark_held_locks+0x7c/0xb0
[81861b90] ? _raw_spin_unlock_irq+0x30/0x50
[8110772d] ? trace_hardirqs_on_caller+0x15d/0x200
[8185d31c] wait_for_completion+0x10c/0x150
[810e4ed0] ? wake_up_state+0x20/0x20
[8112a219] _rcu_barrier+0x159/0x200
[8112a315] rcu_barrier+0x15/0x20
[8171657f] netdev_run_todo+0x6f/0x310
[8170b145] ? rollback_registered_many+0x265/0x2e0
[817235ee] rtnl_unlock+0xe/0x10
[8170cfa6] default_device_exit_batch+0x156/0x180
[810fd390] ? abort_exclusive_wait+0xb0/0xb0
[81705053] ops_exit_list.isra.1+0x53/0x60
[81705c00] cleanup_net+0x100/0x1f0
[810cca98] process_one_work+0x218/0x850
[810cc9ff] ? process_one_work+0x17f/0x850
[810cd1b7] ? worker_thread+0xe7/0x4a0
[810cd13b] worker_thread+0x6b/0x4a0
[810cd0d0] ? process_one_work+0x850/0x850
[810d348b] kthread+0x10b/0x130
[81028c69] ? sched_clock+0x9/0x10
[810d3380] ? kthread_create_on_node+0x250/0x250
[818628bc] ret_from_fork+0x7c/0xb0
[810d3380] ? kthread_create_on_node+0x250/0x250
   4 locks held by kworker/u16:6/96:
#0:  (%snetns){.+.+.+}, at: [810cc9ff] 
   process_one_work+0x17f/0x850
#1:  (net_cleanup_work){+.+.+.}, at: [810cc9ff] 
   process_one_work+0x17f/0x850
#2:  (net_mutex){+.+.+.}, at: [81705b8c] cleanup_net+0x8c/0x1f0
#3:  (rcu_sched_state.barrier_mutex){+.+...}, at: [8112a0f5] 
   _rcu_barrier+0x35/0x200
   INFO: task modprobe:1045 blocked for more than 120 seconds.
 Not tainted 3.18.0-rc1+ #4
   echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
   modprobeD 880218343480 12920  1045   1044 0x0080
880218353bf8 0096 880218343480 001d5f00
880218353fd8 001d5f00 81e1b580 880218343480
880218343480 81f8f748 0246 880218343480
   Call Trace:
[8185be91] schedule_preempt_disabled+0x31/0x80
[8185d6e3] mutex_lock_nested+0x183/0x440
[81705a1f] ? register_pernet_subsys+0x1f/0x50
[81705a1f] ? register_pernet_subsys+0x1f/0x50
[a0673000] ? 0xa0673000
[81705a1f] register_pernet_subsys+0x1f/0x50
[a0673048] br_init+0x48/0xd3 [bridge]
[81002148] do_one_initcall+0xd8/0x210
[81153052] load_module+0x20c2/0x2870
[8114e030] ? store_uevent+0x70/0x70
[81278717] ? kernel_read+0x57/0x90
[811539e6] SyS_finit_module+0xa6/0xe0
[81862969] system_call_fastpath+0x12/0x17
   1 lock held by modprobe/1045:
#0:  (net_mutex){+.+.+.}, at: [81705a1f] 
   register_pernet_subsys+0x1f/0x50
  
  Presumably the kworker/u16:6 completed, then modprobe hung?
  
  If not, I have some very hard questions about why net_mutex can be
  held by two tasks concurrently, given that it does not appear to be a
  reader-writer lock...
  
  Either 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Yanko Kaneti
On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
 On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
  On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
   On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:

On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
 On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
 
 [ . . . ]
 
  Indeed, c847f14217d5 it is.
  
  Much to my embarrasment I just noticed that in addition to the
  rcu merge, triggering the bug requires my specific Fedora 
  rawhide network
  setup. Booting in single mode and modprobe ppp_generic is fine. 
  The bug
  appears when starting with my regular fedora network setup, which 
  in my case
  includes 3 ethernet adapters and a libvirt birdge+nat setup.
  
  Hope that helps.
  
  I am attaching the config.
 
 It does help a lot, thank you!!!
 
 The following patch is a bit of a shot in the dark, and assumes that
 commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled 
 idle
 code) introduced the problem.  Does this patch fix things up?

Unfortunately not, This is linus-tip + patch
   
   OK.  Can't have everything, I guess.
   
INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
  Not tainted 3.18.0-rc1+ #4
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
kworker/u16:6   D 8800ca84cec0 1116896  2 0x
Workqueue: netns cleanup_net
 8802218339e8 0096 8800ca84cec0 001d5f00
 880221833fd8 001d5f00 880223264ec0 8800ca84cec0
 82c52040 7fff 81ee2658 81ee2650
Call Trace:
 [8185b8e9] schedule+0x29/0x70
 [81860b0c] schedule_timeout+0x26c/0x410
 [81028bea] ? native_sched_clock+0x2a/0xa0
 [8110759c] ? mark_held_locks+0x7c/0xb0
 [81861b90] ? _raw_spin_unlock_irq+0x30/0x50
 [8110772d] ? trace_hardirqs_on_caller+0x15d/0x200
 [8185d31c] wait_for_completion+0x10c/0x150
 [810e4ed0] ? wake_up_state+0x20/0x20
 [8112a219] _rcu_barrier+0x159/0x200
 [8112a315] rcu_barrier+0x15/0x20
 [8171657f] netdev_run_todo+0x6f/0x310
 [8170b145] ? rollback_registered_many+0x265/0x2e0
 [817235ee] rtnl_unlock+0xe/0x10
 [8170cfa6] default_device_exit_batch+0x156/0x180
 [810fd390] ? abort_exclusive_wait+0xb0/0xb0
 [81705053] ops_exit_list.isra.1+0x53/0x60
 [81705c00] cleanup_net+0x100/0x1f0
 [810cca98] process_one_work+0x218/0x850
 [810cc9ff] ? process_one_work+0x17f/0x850
 [810cd1b7] ? worker_thread+0xe7/0x4a0
 [810cd13b] worker_thread+0x6b/0x4a0
 [810cd0d0] ? process_one_work+0x850/0x850
 [810d348b] kthread+0x10b/0x130
 [81028c69] ? sched_clock+0x9/0x10
 [810d3380] ? kthread_create_on_node+0x250/0x250
 [818628bc] ret_from_fork+0x7c/0xb0
 [810d3380] ? kthread_create_on_node+0x250/0x250
4 locks held by kworker/u16:6/96:
 #0:  (%snetns){.+.+.+}, at: [810cc9ff] 
process_one_work+0x17f/0x850
 #1:  (net_cleanup_work){+.+.+.}, at: [810cc9ff] 
process_one_work+0x17f/0x850
 #2:  (net_mutex){+.+.+.}, at: [81705b8c] 
cleanup_net+0x8c/0x1f0
 #3:  (rcu_sched_state.barrier_mutex){+.+...}, at: [8112a0f5] 
_rcu_barrier+0x35/0x200
INFO: task modprobe:1045 blocked for more than 120 seconds.
  Not tainted 3.18.0-rc1+ #4
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
modprobeD 880218343480 12920  1045   1044 0x0080
 880218353bf8 0096 880218343480 001d5f00
 880218353fd8 001d5f00 81e1b580 880218343480
 880218343480 81f8f748 0246 880218343480
Call Trace:
 [8185be91] schedule_preempt_disabled+0x31/0x80
 [8185d6e3] mutex_lock_nested+0x183/0x440
 [81705a1f] ? register_pernet_subsys+0x1f/0x50
 [81705a1f] ? register_pernet_subsys+0x1f/0x50
 [a0673000] ? 0xa0673000
 [81705a1f] register_pernet_subsys+0x1f/0x50
 [a0673048] br_init+0x48/0xd3 [bridge]
 [81002148] do_one_initcall+0xd8/0x210
 [81153052] load_module+0x20c2/0x2870
 [8114e030] ? store_uevent+0x70/0x70
 [81278717] ? kernel_read+0x57/0x90
 [811539e6] SyS_finit_module+0xa6/0xe0
 [81862969] system_call_fastpath+0x12/0x17
1 lock held by modprobe/1045:
 #0:  (net_mutex){+.+.+.}, at: [81705a1f] 
register_pernet_subsys+0x1f/0x50
   
   Presumably the kworker/u16:6 completed, then modprobe hung?
   
   If 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
 On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
  On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
   On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
 
 On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
  On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:

[ . . . ]

   Ok, unless I've messsed up something major, bisecting points to:
   
   35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
   
   Makes any sense ?
  
  Good question.  ;-)
  
  Are any of your online CPUs missing rcuo kthreads?  There should be
  kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each online CPU.
 
 Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a reverted, the 
 rcuos are 8
 and the modprobe ppp_generic testcase reliably works, libvirt also manages
 to setup its bridge.
 
 Just with linux-tip , the rcuos are 6 but the failure is as reliable as
 before.

Thank you, very interesting.  Which 6 of the rcuos are present?

 Awating instructions: :)

Well, I thought I understood the problem until you found that only 6 of
the expected 8 rcuos are present with linux-tip without the revert.  ;-)

I am putting together a patch for the part of the problem that I think
I understand, of course, but it would help a lot to know which two of
the rcuos are missing.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Yanko Kaneti
On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
 On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
  On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
   On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
 On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
  
  On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
   On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
 
 [ . . . ]
 
Ok, unless I've messsed up something major, bisecting points to:

35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs

Makes any sense ?
   
   Good question.  ;-)
   
   Are any of your online CPUs missing rcuo kthreads?  There should be
   kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each online CPU.
  
  Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a reverted, the 
  rcuos are 8
  and the modprobe ppp_generic testcase reliably works, libvirt also manages
  to setup its bridge.
  
  Just with linux-tip , the rcuos are 6 but the failure is as reliable as
  before.

 
 Thank you, very interesting.  Which 6 of the rcuos are present?

Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like this   
Phenom II.

 
  Awating instructions: :)
 
 Well, I thought I understood the problem until you found that only 6 of
 the expected 8 rcuos are present with linux-tip without the revert.  ;-)
 
 I am putting together a patch for the part of the problem that I think
 I understand, of course, but it would help a lot to know which two of
 the rcuos are missing.  ;-)


Ready to test

--Yanko
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
 On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
  On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
   On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
 On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
  On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
   
   On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
  
  [ . . . ]
  
 Ok, unless I've messsed up something major, bisecting points to:
 
 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
 
 Makes any sense ?

Good question.  ;-)

Are any of your online CPUs missing rcuo kthreads?  There should be
kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each online CPU.
   
   Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a reverted, 
   the rcuos are 8
   and the modprobe ppp_generic testcase reliably works, libvirt also manages
   to setup its bridge.
   
   Just with linux-tip , the rcuos are 6 but the failure is as reliable as
   before.
 
  Thank you, very interesting.  Which 6 of the rcuos are present?
 
 Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like this   
 Phenom II.

Ah, you get 8 without the patch because it creates them for potential
CPUs as well as real ones.  OK, got it.

   Awating instructions: :)
  
  Well, I thought I understood the problem until you found that only 6 of
  the expected 8 rcuos are present with linux-tip without the revert.  ;-)
  
  I am putting together a patch for the part of the problem that I think
  I understand, of course, but it would help a lot to know which two of
  the rcuos are missing.  ;-)
 
 Ready to test

Well, if you are feeling aggressive, give the following patch a spin.
I am doing sanity tests on it in the meantime.

Thanx, Paul



diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 29fb23f33c18..927c17b081c7 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct rcu_state 
*rsp, int cpu)
rdp-nocb_leader = rdp_spawn;
if (rdp_last  rdp != rdp_spawn)
rdp_last-nocb_next_follower = rdp;
-   rdp_last = rdp;
-   rdp = rdp-nocb_next_follower;
-   rdp_last-nocb_next_follower = NULL;
+   if (rdp == rdp_spawn) {
+   rdp = rdp-nocb_next_follower;
+   } else {
+   rdp_last = rdp;
+   rdp = rdp-nocb_next_follower;
+   rdp_last-nocb_next_follower = NULL;
+   }
} while (rdp);
rdp_spawn-nocb_next_follower = rdp_old_leader;
}

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Yanko Kaneti
On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
 On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
  On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
   On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
 On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
  On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
   On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:

On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
 On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
   
   [ . . . ]
   
  Ok, unless I've messsed up something major, bisecting points to:
  
  35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
  
  Makes any sense ?
 
 Good question.  ;-)
 
 Are any of your online CPUs missing rcuo kthreads?  There should be
 kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each online 
 CPU.

Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a reverted, 
the rcuos are 8
and the modprobe ppp_generic testcase reliably works, libvirt also 
manages
to setup its bridge.

Just with linux-tip , the rcuos are 6 but the failure is as reliable as
before.
  
   Thank you, very interesting.  Which 6 of the rcuos are present?
  
  Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like this   
  Phenom II.
 
 Ah, you get 8 without the patch because it creates them for potential
 CPUs as well as real ones.  OK, got it.
 
Awating instructions: :)
   
   Well, I thought I understood the problem until you found that only 6 of
   the expected 8 rcuos are present with linux-tip without the revert.  ;-)
   
   I am putting together a patch for the part of the problem that I think
   I understand, of course, but it would help a lot to know which two of
   the rcuos are missing.  ;-)
  
  Ready to test
 
 Well, if you are feeling aggressive, give the following patch a spin.
 I am doing sanity tests on it in the meantime.

Doesn't seem to make a difference here

 
   Thanx, Paul
 
 
 
 diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
 index 29fb23f33c18..927c17b081c7 100644
 --- a/kernel/rcu/tree_plugin.h
 +++ b/kernel/rcu/tree_plugin.h
 @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct 
 rcu_state *rsp, int cpu)
   rdp-nocb_leader = rdp_spawn;
   if (rdp_last  rdp != rdp_spawn)
   rdp_last-nocb_next_follower = rdp;
 - rdp_last = rdp;
 - rdp = rdp-nocb_next_follower;
 - rdp_last-nocb_next_follower = NULL;
 + if (rdp == rdp_spawn) {
 + rdp = rdp-nocb_next_follower;
 + } else {
 + rdp_last = rdp;
 + rdp = rdp-nocb_next_follower;
 + rdp_last-nocb_next_follower = NULL;
 + }
   } while (rdp);
   rdp_spawn-nocb_next_follower = rdp_old_leader;
   }
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh
Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Thu, Oct 23, 2014 at 09:48:34PM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
[...]
 Either way, my patch assumed that 39953dfd4007 (rcu: Avoid misordering in
 __call_rcu_nocb_enqueue()) would work and that 1772947bd012 (rcu: Handle
 NOCB callbacks from irq-disabled idle code) would fail.  Is that the case?
 If not, could you please bisect the commits between 11ed7f934cb8 (rcu:
 Make nocb leader kthreads process pending callbacks after spawning)
 and c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())?
 
  Just a note to add that I am also reliably inducing what appears
 to be this issue on a current -net tree, when configuring openvswitch
 via script.  I am available to test patches or bisect tomorrow (Friday)
 US time if needed.

Thank you, Jay!  Could you please check to see if reverting this commit
fixes things for you?

35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs

Reverting is not a long-term fix, as this commit is itself a bug fix,
but would be good to check to see if you are seeing the same thing that
Yanko is.  ;-)

Just to confirm what Yanko found, reverting this commit makes
the problem go away for me.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
 On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
  On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
   On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
 On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
  On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
   On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
 
 On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
  On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti 
  wrote:

[ . . . ]

   Ok, unless I've messsed up something major, bisecting points to:
   
   35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
   
   Makes any sense ?
  
  Good question.  ;-)
  
  Are any of your online CPUs missing rcuo kthreads?  There should be
  kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each online 
  CPU.
 
 Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a 
 reverted, the rcuos are 8
 and the modprobe ppp_generic testcase reliably works, libvirt also 
 manages
 to setup its bridge.
 
 Just with linux-tip , the rcuos are 6 but the failure is as reliable 
 as
 before.
   
Thank you, very interesting.  Which 6 of the rcuos are present?
   
   Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like this 
 
   Phenom II.
  
  Ah, you get 8 without the patch because it creates them for potential
  CPUs as well as real ones.  OK, got it.
  
 Awating instructions: :)

Well, I thought I understood the problem until you found that only 6 of
the expected 8 rcuos are present with linux-tip without the revert.  ;-)

I am putting together a patch for the part of the problem that I think
I understand, of course, but it would help a lot to know which two of
the rcuos are missing.  ;-)
   
   Ready to test
  
  Well, if you are feeling aggressive, give the following patch a spin.
  I am doing sanity tests on it in the meantime.
 
 Doesn't seem to make a difference here

OK, inspection isn't cutting it, so time for tracing.  Does the system
respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
the problem occurs, then dump the trace buffer after the problem occurs.

Thanx, Paul

  
  
  diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
  index 29fb23f33c18..927c17b081c7 100644
  --- a/kernel/rcu/tree_plugin.h
  +++ b/kernel/rcu/tree_plugin.h
  @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct 
  rcu_state *rsp, int cpu)
  rdp-nocb_leader = rdp_spawn;
  if (rdp_last  rdp != rdp_spawn)
  rdp_last-nocb_next_follower = rdp;
  -   rdp_last = rdp;
  -   rdp = rdp-nocb_next_follower;
  -   rdp_last-nocb_next_follower = NULL;
  +   if (rdp == rdp_spawn) {
  +   rdp = rdp-nocb_next_follower;
  +   } else {
  +   rdp_last = rdp;
  +   rdp = rdp-nocb_next_follower;
  +   rdp_last-nocb_next_follower = NULL;
  +   }
  } while (rdp);
  rdp_spawn-nocb_next_follower = rdp_old_leader;
  }
  
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 11:20:11AM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 On Thu, Oct 23, 2014 at 09:48:34PM -0700, Jay Vosburgh wrote:
  Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 [...]
  Either way, my patch assumed that 39953dfd4007 (rcu: Avoid misordering in
  __call_rcu_nocb_enqueue()) would work and that 1772947bd012 (rcu: Handle
  NOCB callbacks from irq-disabled idle code) would fail.  Is that the case?
  If not, could you please bisect the commits between 11ed7f934cb8 (rcu:
  Make nocb leader kthreads process pending callbacks after spawning)
  and c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())?
  
 Just a note to add that I am also reliably inducing what appears
  to be this issue on a current -net tree, when configuring openvswitch
  via script.  I am available to test patches or bisect tomorrow (Friday)
  US time if needed.
 
 Thank you, Jay!  Could you please check to see if reverting this commit
 fixes things for you?
 
 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
 
 Reverting is not a long-term fix, as this commit is itself a bug fix,
 but would be good to check to see if you are seeing the same thing that
 Yanko is.  ;-)
 
   Just to confirm what Yanko found, reverting this commit makes
 the problem go away for me.

Thank you!

I take it that the patches that don't help Yanko also don't help you?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh
Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
 On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
  On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
   On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
 On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
  On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
   On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
 
 On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
  On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti 
  wrote:

[ . . . ]

   Ok, unless I've messsed up something major, bisecting points to:
   
   35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
   
   Makes any sense ?
  
  Good question.  ;-)
  
  Are any of your online CPUs missing rcuo kthreads?  There should be
  kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each 
  online CPU.
 
 Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a 
 reverted, the rcuos are 8
 and the modprobe ppp_generic testcase reliably works, libvirt also 
 manages
 to setup its bridge.
 
 Just with linux-tip , the rcuos are 6 but the failure is as reliable 
 as
 before.
   
Thank you, very interesting.  Which 6 of the rcuos are present?
   
   Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like 
   this   
   Phenom II.
  
  Ah, you get 8 without the patch because it creates them for potential
  CPUs as well as real ones.  OK, got it.
  
 Awating instructions: :)

Well, I thought I understood the problem until you found that only 6 of
the expected 8 rcuos are present with linux-tip without the revert.  
;-)

I am putting together a patch for the part of the problem that I think
I understand, of course, but it would help a lot to know which two of
the rcuos are missing.  ;-)
   
   Ready to test
  
  Well, if you are feeling aggressive, give the following patch a spin.
  I am doing sanity tests on it in the meantime.
 
 Doesn't seem to make a difference here

OK, inspection isn't cutting it, so time for tracing.  Does the system
respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
the problem occurs, then dump the trace buffer after the problem occurs.

My system is up and responsive when the problem occurs, so this
shouldn't be a problem.

Do you want the ftrace with your patch below, or unmodified tip
of tree?

-J


   Thanx, Paul

  
  
  diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
  index 29fb23f33c18..927c17b081c7 100644
  --- a/kernel/rcu/tree_plugin.h
  +++ b/kernel/rcu/tree_plugin.h
  @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct 
  rcu_state *rsp, int cpu)
 rdp-nocb_leader = rdp_spawn;
 if (rdp_last  rdp != rdp_spawn)
 rdp_last-nocb_next_follower = rdp;
  -  rdp_last = rdp;
  -  rdp = rdp-nocb_next_follower;
  -  rdp_last-nocb_next_follower = NULL;
  +  if (rdp == rdp_spawn) {
  +  rdp = rdp-nocb_next_follower;
  +  } else {
  +  rdp_last = rdp;
  +  rdp = rdp-nocb_next_follower;
  +  rdp_last-nocb_next_follower = NULL;
  +  }
 } while (rdp);
 rdp_spawn-nocb_next_follower = rdp_old_leader;
 }
  

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 11:49:48AM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
  On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
   On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
 On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
  On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
   On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
 On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
  
  On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
   On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti 
   wrote:
 
 [ . . . ]
 
Ok, unless I've messsed up something major, bisecting points 
to:

35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs

Makes any sense ?
   
   Good question.  ;-)
   
   Are any of your online CPUs missing rcuo kthreads?  There should 
   be
   kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each 
   online CPU.
  
  Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a 
  reverted, the rcuos are 8
  and the modprobe ppp_generic testcase reliably works, libvirt also 
  manages
  to setup its bridge.
  
  Just with linux-tip , the rcuos are 6 but the failure is as 
  reliable as
  before.

 Thank you, very interesting.  Which 6 of the rcuos are present?

Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like 
this   
Phenom II.
   
   Ah, you get 8 without the patch because it creates them for potential
   CPUs as well as real ones.  OK, got it.
   
  Awating instructions: :)
 
 Well, I thought I understood the problem until you found that only 6 
 of
 the expected 8 rcuos are present with linux-tip without the revert.  
 ;-)
 
 I am putting together a patch for the part of the problem that I 
 think
 I understand, of course, but it would help a lot to know which two of
 the rcuos are missing.  ;-)

Ready to test
   
   Well, if you are feeling aggressive, give the following patch a spin.
   I am doing sanity tests on it in the meantime.
  
  Doesn't seem to make a difference here
 
 OK, inspection isn't cutting it, so time for tracing.  Does the system
 respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
 the problem occurs, then dump the trace buffer after the problem occurs.
 
   My system is up and responsive when the problem occurs, so this
 shouldn't be a problem.

Nice!  ;-)

   Do you want the ftrace with your patch below, or unmodified tip
 of tree?

Let's please start with the patch.

Thanx, Paul

   -J
 
 
  Thanx, Paul
 
   
   
   diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
   index 29fb23f33c18..927c17b081c7 100644
   --- a/kernel/rcu/tree_plugin.h
   +++ b/kernel/rcu/tree_plugin.h
   @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct 
   rcu_state *rsp, int cpu)
rdp-nocb_leader = rdp_spawn;
if (rdp_last  rdp != rdp_spawn)
rdp_last-nocb_next_follower = rdp;
   -rdp_last = rdp;
   -rdp = rdp-nocb_next_follower;
   -rdp_last-nocb_next_follower = NULL;
   +if (rdp == rdp_spawn) {
   +rdp = rdp-nocb_next_follower;
   +} else {
   +rdp_last = rdp;
   +rdp = rdp-nocb_next_follower;
   +rdp_last-nocb_next_follower = NULL;
   +}
} while (rdp);
rdp_spawn-nocb_next_follower = rdp_old_leader;
}
   
 
 ---
   -Jay Vosburgh, jay.vosbu...@canonical.com
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 11:57:53AM -0700, Paul E. McKenney wrote:
 On Fri, Oct 24, 2014 at 11:49:48AM -0700, Jay Vosburgh wrote:
  Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
  
  On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
   On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
 On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
  On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
   On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
 On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
  On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti 
  wrote:
   
   On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney 
   wrote:
On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti 
wrote:
  
  [ . . . ]
  
 Ok, unless I've messsed up something major, bisecting points 
 to:
 
 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
 
 Makes any sense ?

Good question.  ;-)

Are any of your online CPUs missing rcuo kthreads?  There 
should be
kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each 
online CPU.
   
   Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a 
   reverted, the rcuos are 8
   and the modprobe ppp_generic testcase reliably works, libvirt 
   also manages
   to setup its bridge.
   
   Just with linux-tip , the rcuos are 6 but the failure is as 
   reliable as
   before.
 
  Thank you, very interesting.  Which 6 of the rcuos are present?
 
 Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like 
 this   
 Phenom II.

Ah, you get 8 without the patch because it creates them for potential
CPUs as well as real ones.  OK, got it.

   Awating instructions: :)
  
  Well, I thought I understood the problem until you found that only 
  6 of
  the expected 8 rcuos are present with linux-tip without the 
  revert.  ;-)
  
  I am putting together a patch for the part of the problem that I 
  think
  I understand, of course, but it would help a lot to know which two 
  of
  the rcuos are missing.  ;-)
 
 Ready to test

Well, if you are feeling aggressive, give the following patch a spin.
I am doing sanity tests on it in the meantime.
   
   Doesn't seem to make a difference here
  
  OK, inspection isn't cutting it, so time for tracing.  Does the system
  respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
  the problem occurs, then dump the trace buffer after the problem occurs.
  
  My system is up and responsive when the problem occurs, so this
  shouldn't be a problem.
 
 Nice!  ;-)
 
  Do you want the ftrace with your patch below, or unmodified tip
  of tree?
 
 Let's please start with the patch.

And I should hasten to add that you need to set CONFIG_RCU_TRACE=y
for these tracepoints to be enabled.

Thanx, Paul

  -J
  
  
 Thanx, Paul
  


diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 29fb23f33c18..927c17b081c7 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct 
rcu_state *rsp, int cpu)
   rdp-nocb_leader = rdp_spawn;
   if (rdp_last  rdp != rdp_spawn)
   rdp_last-nocb_next_follower = rdp;
-  rdp_last = rdp;
-  rdp = rdp-nocb_next_follower;
-  rdp_last-nocb_next_follower = NULL;
+  if (rdp == rdp_spawn) {
+  rdp = rdp-nocb_next_follower;
+  } else {
+  rdp_last = rdp;
+  rdp = rdp-nocb_next_follower;
+  rdp_last-nocb_next_follower = NULL;
+  }
   } while (rdp);
   rdp_spawn-nocb_next_follower = rdp_old_leader;
   }

  
  ---
  -Jay Vosburgh, jay.vosbu...@canonical.com
  

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Yanko Kaneti
On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
 On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
  On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
   On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
 On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
  On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
   On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
 On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
  
  On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
   On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti 
   wrote:
 
 [ . . . ]
 
Ok, unless I've messsed up something major, bisecting points to:

35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs

Makes any sense ?
   
   Good question.  ;-)
   
   Are any of your online CPUs missing rcuo kthreads?  There should 
   be
   kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each 
   online CPU.
  
  Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a 
  reverted, the rcuos are 8
  and the modprobe ppp_generic testcase reliably works, libvirt also 
  manages
  to setup its bridge.
  
  Just with linux-tip , the rcuos are 6 but the failure is as 
  reliable as
  before.

 Thank you, very interesting.  Which 6 of the rcuos are present?

Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like 
this   
Phenom II.
   
   Ah, you get 8 without the patch because it creates them for potential
   CPUs as well as real ones.  OK, got it.
   
  Awating instructions: :)
 
 Well, I thought I understood the problem until you found that only 6 
 of
 the expected 8 rcuos are present with linux-tip without the revert.  
 ;-)
 
 I am putting together a patch for the part of the problem that I think
 I understand, of course, but it would help a lot to know which two of
 the rcuos are missing.  ;-)

Ready to test
   
   Well, if you are feeling aggressive, give the following patch a spin.
   I am doing sanity tests on it in the meantime.
  
  Doesn't seem to make a difference here
 
 OK, inspection isn't cutting it, so time for tracing.  Does the system
 respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
 the problem occurs, then dump the trace buffer after the problem occurs.

Sorry for being unresposive here, but I know next to nothing about tracing
or most things about the kernel, so I have some cathing up to do.

In the meantime some layman observations while I tried to find what exactly
triggers the problem.
- Even in runlevel 1 I can reliably trigger the problem by starting libvirtd
- libvirtd seems to be very active in using all sorts of kernel facilities
  that are modules on fedora so it seems to cause many simultaneous kworker 
  calls to modprobe
- there are 8 kworker/u16 from 0 to 7
- one of these kworkers always deadlocks, while there appear to be two
  kworker/u16:6 - the seventh

  6 vs 8 as in 6 rcuos where before they were always 8

Just observations from someone who still doesn't know what the u16
kworkers are..

-- Yanko



 
   Thanx, Paul
 
   
   
   diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
   index 29fb23f33c18..927c17b081c7 100644
   --- a/kernel/rcu/tree_plugin.h
   +++ b/kernel/rcu/tree_plugin.h
   @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct 
   rcu_state *rsp, int cpu)
 rdp-nocb_leader = rdp_spawn;
 if (rdp_last  rdp != rdp_spawn)
 rdp_last-nocb_next_follower = rdp;
   - rdp_last = rdp;
   - rdp = rdp-nocb_next_follower;
   - rdp_last-nocb_next_follower = NULL;
   + if (rdp == rdp_spawn) {
   + rdp = rdp-nocb_next_follower;
   + } else {
   + rdp_last = rdp;
   + rdp = rdp-nocb_next_follower;
   + rdp_last-nocb_next_follower = NULL;
   + }
 } while (rdp);
 rdp_spawn-nocb_next_follower = rdp_old_leader;
 }
   
  
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
 On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
  On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
   On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:

[ . . . ]

Well, if you are feeling aggressive, give the following patch a spin.
I am doing sanity tests on it in the meantime.
   
   Doesn't seem to make a difference here
  
  OK, inspection isn't cutting it, so time for tracing.  Does the system
  respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
  the problem occurs, then dump the trace buffer after the problem occurs.
 
 Sorry for being unresposive here, but I know next to nothing about tracing
 or most things about the kernel, so I have some cathing up to do.
 
 In the meantime some layman observations while I tried to find what exactly
 triggers the problem.
 - Even in runlevel 1 I can reliably trigger the problem by starting libvirtd
 - libvirtd seems to be very active in using all sorts of kernel facilities
   that are modules on fedora so it seems to cause many simultaneous kworker 
   calls to modprobe
 - there are 8 kworker/u16 from 0 to 7
 - one of these kworkers always deadlocks, while there appear to be two
   kworker/u16:6 - the seventh

Adding Tejun on CC in case this duplication of kworker/u16:6 is important.

   6 vs 8 as in 6 rcuos where before they were always 8
 
 Just observations from someone who still doesn't know what the u16
 kworkers are..

Could you please run the following diagnostic patch?  This will help
me see if I have managed to miswire the rcuo kthreads.  It should
print some information at task-hang time.

Thanx, Paul



rcu: Dump no-CBs CPU state at task-hung time

Strictly diagnostic commit for rcu_barrier() hang.  Not for inclusion.

Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 0e5366200154..34048140577b 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -157,4 +157,8 @@ static inline bool rcu_is_watching(void)
 
 #endif /* #else defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) 
*/
 
+static inline void rcu_show_nocb_setup(void)
+{
+}
+
 #endif /* __LINUX_RCUTINY_H */
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 52953790dcca..0b813bdb971b 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -97,4 +97,6 @@ extern int rcu_scheduler_active __read_mostly;
 
 bool rcu_is_watching(void);
 
+void rcu_show_nocb_setup(void);
+
 #endif /* __LINUX_RCUTREE_H */
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 06db12434d72..e6e4d0f6b063 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -118,6 +118,7 @@ static void check_hung_task(struct task_struct *t, unsigned 
long timeout)
 disables this message.\n);
sched_show_task(t);
debug_show_held_locks(t);
+   rcu_show_nocb_setup();
 
touch_nmi_watchdog();
 
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 240fa9094f83..6b373e79ce0e 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1513,6 +1513,7 @@ rcu_torture_cleanup(void)
 {
int i;
 
+   rcu_show_nocb_setup();
rcutorture_record_test_transition();
if (torture_cleanup_begin()) {
if (cur_ops-cb_barrier != NULL)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 927c17b081c7..285b3f6fb229 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2699,6 +2699,31 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)
 
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 
+void rcu_show_nocb_setup(void)
+{
+#ifdef CONFIG_RCU_NOCB_CPU
+   int cpu;
+   struct rcu_data *rdp;
+   struct rcu_state *rsp;
+
+   for_each_rcu_flavor(rsp) {
+   pr_alert(rcu_show_nocb_setup(): %s nocb state:\n, rsp-name);
+   for_each_possible_cpu(cpu) {
+   if (!rcu_is_nocb_cpu(cpu))
+   continue;
+   rdp = per_cpu_ptr(rsp-rda, cpu);
+   pr_alert(%3d: %p l:%p n:%p %c%c%c\n,
+cpu,
+rdp, rdp-nocb_leader, rdp-nocb_next_follower,
+.N[!!rdp-nocb_head],
+.G[!!rdp-nocb_gp_head],
+.F[!!rdp-nocb_follower_head]);
+   }
+   }
+#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
+}
+EXPORT_SYMBOL_GPL(rcu_show_nocb_setup);
+
 /*
  * An adaptive-ticks CPU can potentially execute in kernel mode for an
  * arbitrarily long period of time with the scheduling-clock tick turned

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh
Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
 On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
  On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
   On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:

[ . . . ]

Well, if you are feeling aggressive, give the following patch a spin.
I am doing sanity tests on it in the meantime.
   
   Doesn't seem to make a difference here
  
  OK, inspection isn't cutting it, so time for tracing.  Does the system
  respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
  the problem occurs, then dump the trace buffer after the problem occurs.
 
 Sorry for being unresposive here, but I know next to nothing about tracing
 or most things about the kernel, so I have some cathing up to do.
 
 In the meantime some layman observations while I tried to find what exactly
 triggers the problem.
 - Even in runlevel 1 I can reliably trigger the problem by starting libvirtd
 - libvirtd seems to be very active in using all sorts of kernel facilities
   that are modules on fedora so it seems to cause many simultaneous kworker 
   calls to modprobe
 - there are 8 kworker/u16 from 0 to 7
 - one of these kworkers always deadlocks, while there appear to be two
   kworker/u16:6 - the seventh

Adding Tejun on CC in case this duplication of kworker/u16:6 is important.

   6 vs 8 as in 6 rcuos where before they were always 8
 
 Just observations from someone who still doesn't know what the u16
 kworkers are..

Could you please run the following diagnostic patch?  This will help
me see if I have managed to miswire the rcuo kthreads.  It should
print some information at task-hang time.

I can give this a spin after the ftrace (now that I've got
CONFIG_RCU_TRACE turned on).

I've got an ftrace capture from unmodified -net, it looks like
this:

ovs-vswitchd-902   [000]    471.778441: rcu_barrier: rcu_sched Begin 
cpu -1 remaining 0 # 0
ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Check 
cpu -1 remaining 0 # 0
ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Inc1 cpu 
-1 remaining 0 # 1
ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
OnlineNoCB cpu 0 remaining 1 # 1
ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
OnlineNoCB cpu 1 remaining 2 # 1
ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
OnlineNoCB cpu 2 remaining 3 # 1
ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched 
OnlineNoCB cpu 3 remaining 4 # 1
ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched Inc2 cpu 
-1 remaining 4 # 2
 rcuos/0-9 [000] ..s.   471.793150: rcu_barrier: rcu_sched CB cpu 
-1 remaining 3 # 2
 rcuos/1-18[001] ..s.   471.793308: rcu_barrier: rcu_sched CB cpu 
-1 remaining 2 # 2

I let it sit through several hung task cycles but that was all
there was for rcu:rcu_barrier.

I should have ftrace with the patch as soon as the kernel is
done building, then I can try the below patch (I'll start it building
now).

-J




   Thanx, Paul



rcu: Dump no-CBs CPU state at task-hung time

Strictly diagnostic commit for rcu_barrier() hang.  Not for inclusion.

Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 0e5366200154..34048140577b 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -157,4 +157,8 @@ static inline bool rcu_is_watching(void)
 
 #endif /* #else defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) 
 */
 
+static inline void rcu_show_nocb_setup(void)
+{
+}
+
 #endif /* __LINUX_RCUTINY_H */
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 52953790dcca..0b813bdb971b 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -97,4 +97,6 @@ extern int rcu_scheduler_active __read_mostly;
 
 bool rcu_is_watching(void);
 
+void rcu_show_nocb_setup(void);
+
 #endif /* __LINUX_RCUTREE_H */
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 06db12434d72..e6e4d0f6b063 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -118,6 +118,7 @@ static void check_hung_task(struct task_struct *t, 
unsigned long timeout)
disables this message.\n);
   sched_show_task(t);
   debug_show_held_locks(t);
+  rcu_show_nocb_setup();
 
   touch_nmi_watchdog();
 
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 240fa9094f83..6b373e79ce0e 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1513,6 +1513,7 @@ rcu_torture_cleanup(void)
 {
   int i;
 
+  rcu_show_nocb_setup();
   rcutorture_record_test_transition();
   if (torture_cleanup_begin()) {

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 03:02:04PM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
  On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
   On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
 
 [ . . . ]
 
 Well, if you are feeling aggressive, give the following patch a spin.
 I am doing sanity tests on it in the meantime.

Doesn't seem to make a difference here
   
   OK, inspection isn't cutting it, so time for tracing.  Does the system
   respond to user input?  If so, please enable rcu:rcu_barrier ftrace 
   before
   the problem occurs, then dump the trace buffer after the problem occurs.
  
  Sorry for being unresposive here, but I know next to nothing about tracing
  or most things about the kernel, so I have some cathing up to do.
  
  In the meantime some layman observations while I tried to find what exactly
  triggers the problem.
  - Even in runlevel 1 I can reliably trigger the problem by starting 
  libvirtd
  - libvirtd seems to be very active in using all sorts of kernel facilities
that are modules on fedora so it seems to cause many simultaneous 
  kworker 
calls to modprobe
  - there are 8 kworker/u16 from 0 to 7
  - one of these kworkers always deadlocks, while there appear to be two
kworker/u16:6 - the seventh
 
 Adding Tejun on CC in case this duplication of kworker/u16:6 is important.
 
6 vs 8 as in 6 rcuos where before they were always 8
  
  Just observations from someone who still doesn't know what the u16
  kworkers are..
 
 Could you please run the following diagnostic patch?  This will help
 me see if I have managed to miswire the rcuo kthreads.  It should
 print some information at task-hang time.
 
   I can give this a spin after the ftrace (now that I've got
 CONFIG_RCU_TRACE turned on).
 
   I've got an ftrace capture from unmodified -net, it looks like
 this:
 
 ovs-vswitchd-902   [000]    471.778441: rcu_barrier: rcu_sched Begin 
 cpu -1 remaining 0 # 0
 ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Check 
 cpu -1 remaining 0 # 0
 ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Inc1 
 cpu -1 remaining 0 # 1
 ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 0 remaining 1 # 1
 ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 1 remaining 2 # 1
 ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 2 remaining 3 # 1
 ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 3 remaining 4 # 1

OK, so it looks like your system has four CPUs, and rcu_barrier() placed
callbacks on them all.

 ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched Inc2 
 cpu -1 remaining 4 # 2

The above removes the extra count used to avoid races between posting new
callbacks and completion of previously posted callbacks.

  rcuos/0-9 [000] ..s.   471.793150: rcu_barrier: rcu_sched CB cpu 
 -1 remaining 3 # 2
  rcuos/1-18[001] ..s.   471.793308: rcu_barrier: rcu_sched CB cpu 
 -1 remaining 2 # 2

Two of the four callbacks fired, but the other two appear to be AWOL.
And rcu_barrier() won't return until they all fire.

   I let it sit through several hung task cycles but that was all
 there was for rcu:rcu_barrier.
 
   I should have ftrace with the patch as soon as the kernel is
 done building, then I can try the below patch (I'll start it building
 now).

Sounds very good, looking forward to hearing of the results.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh
Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
 On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
  On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
   On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:

[ . . . ]

Well, if you are feeling aggressive, give the following patch a spin.
I am doing sanity tests on it in the meantime.
   
   Doesn't seem to make a difference here
  
  OK, inspection isn't cutting it, so time for tracing.  Does the system
  respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
  the problem occurs, then dump the trace buffer after the problem occurs.
 
 Sorry for being unresposive here, but I know next to nothing about tracing
 or most things about the kernel, so I have some cathing up to do.
 
 In the meantime some layman observations while I tried to find what exactly
 triggers the problem.
 - Even in runlevel 1 I can reliably trigger the problem by starting libvirtd
 - libvirtd seems to be very active in using all sorts of kernel facilities
   that are modules on fedora so it seems to cause many simultaneous kworker 
   calls to modprobe
 - there are 8 kworker/u16 from 0 to 7
 - one of these kworkers always deadlocks, while there appear to be two
   kworker/u16:6 - the seventh

Adding Tejun on CC in case this duplication of kworker/u16:6 is important.

   6 vs 8 as in 6 rcuos where before they were always 8
 
 Just observations from someone who still doesn't know what the u16
 kworkers are..

Could you please run the following diagnostic patch?  This will help
me see if I have managed to miswire the rcuo kthreads.  It should
print some information at task-hang time.

Here's the output of the patch; I let it sit through two hang
cycles.

-J


[  240.348020] INFO: task ovs-vswitchd:902 blocked for more than 120 seconds.
[  240.354878]   Not tainted 3.17.0-testola+ #4
[  240.359481] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[  240.367285] ovs-vswitchdD 88013fc94600 0   902901 0x0004
[  240.367290]  8800ab20f7b8 0002 8800b3304b00 
8800ab20ffd8
[  240.367293]  00014600 00014600 8800b081 
8800b3304b00
[  240.367296]  8800b3304b00 81c59850 81c59858 
7fff
[  240.367300] Call Trace:
[  240.367307]  [81722b99] schedule+0x29/0x70
[  240.367310]  [81725b6c] schedule_timeout+0x1dc/0x260
[  240.367313]  [81722f69] ? _cond_resched+0x29/0x40
[  240.367316]  [81723818] ? wait_for_completion+0x28/0x160
[  240.367321]  [811081a7] ? queue_stop_cpus_work+0xc7/0xe0
[  240.367324]  [81723896] wait_for_completion+0xa6/0x160
[  240.367328]  [81099980] ? wake_up_state+0x20/0x20
[  240.367331]  [810d0ecc] _rcu_barrier+0x20c/0x480
[  240.367334]  [810d1195] rcu_barrier+0x15/0x20
[  240.367338]  [81625010] netdev_run_todo+0x60/0x300
[  240.367341]  [8162f9ee] rtnl_unlock+0xe/0x10
[  240.367349]  [a01ffcc5] internal_dev_destroy+0x55/0x80 
[openvswitch]
[  240.367354]  [a01ff622] ovs_vport_del+0x32/0x40 [openvswitch]
[  240.367358]  [a01f8dd0] ovs_dp_detach_port+0x30/0x40 [openvswitch]
[  240.367363]  [a01f8ea5] ovs_vport_cmd_del+0xc5/0x110 [openvswitch]
[  240.367367]  [81651d75] genl_family_rcv_msg+0x1a5/0x3c0
[  240.367370]  [81651f90] ? genl_family_rcv_msg+0x3c0/0x3c0
[  240.367372]  [81652021] genl_rcv_msg+0x91/0xd0
[  240.367376]  [81650091] netlink_rcv_skb+0xc1/0xe0
[  240.367378]  [816505bc] genl_rcv+0x2c/0x40
[  240.367381]  [8164f626] netlink_unicast+0xf6/0x200
[  240.367383]  [8164fa4d] netlink_sendmsg+0x31d/0x780
[  240.367387]  [8164ca74] ? netlink_rcv_wake+0x44/0x60
[  240.367391]  [81606a53] sock_sendmsg+0x93/0xd0
[  240.367395]  [81337700] ? apparmor_capable+0x60/0x60
[  240.367399]  [81614f27] ? verify_iovec+0x47/0xd0
[  240.367402]  [81606e79] ___sys_sendmsg+0x399/0x3b0
[  240.367406]  [812598a2] ? kernfs_seq_stop_active+0x32/0x40
[  240.367410]  [8101c385] ? native_sched_clock+0x35/0x90
[  240.367413]  [8101c385] ? native_sched_clock+0x35/0x90
[  240.367416]  [8101c3e9] ? sched_clock+0x9/0x10
[  240.367420]  [811277fc] ? acct_account_cputime+0x1c/0x20
[  240.367424]  [8109ce6b] ? account_user_time+0x8b/0xa0
[  240.367428]  [81200bd5] ? __fget_light+0x25/0x70
[  240.367431]  [81607c02] __sys_sendmsg+0x42/0x80
[  240.367433]  [81607c52] SyS_sendmsg+0x12/0x20
[  240.367436]  [81727464] tracesys_phase2+0xd8/0xdd
[  240.367439] rcu_show_nocb_setup(): rcu_sched nocb state:
[  240.372734]   0: 88013fc0e600 l:88013fc0e600 n:88013fc8e600 .G.
[  240.379673]   1: 88013fc8e600 l:88013fc0e600 n:  (null) .G.
[  

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh
Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Fri, Oct 24, 2014 at 03:02:04PM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
[...]
  I've got an ftrace capture from unmodified -net, it looks like
 this:
 
 ovs-vswitchd-902   [000]    471.778441: rcu_barrier: rcu_sched Begin 
 cpu -1 remaining 0 # 0
 ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Check 
 cpu -1 remaining 0 # 0
 ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Inc1 
 cpu -1 remaining 0 # 1
 ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 0 remaining 1 # 1
 ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 1 remaining 2 # 1
 ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 2 remaining 3 # 1
 ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 3 remaining 4 # 1

OK, so it looks like your system has four CPUs, and rcu_barrier() placed
callbacks on them all.

No, the system has only two CPUs.  It's an Intel Core 2 Duo
E8400, and /proc/cpuinfo agrees that there are only 2.  There is a
potentially relevant-sounding message early in dmesg that says:

[0.00] smpboot: Allowing 4 CPUs, 2 hotplug CPUs

 ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched Inc2 
 cpu -1 remaining 4 # 2

The above removes the extra count used to avoid races between posting new
callbacks and completion of previously posted callbacks.

  rcuos/0-9 [000] ..s.   471.793150: rcu_barrier: rcu_sched CB 
 cpu -1 remaining 3 # 2
  rcuos/1-18[001] ..s.   471.793308: rcu_barrier: rcu_sched CB 
 cpu -1 remaining 2 # 2

Two of the four callbacks fired, but the other two appear to be AWOL.
And rcu_barrier() won't return until they all fire.

  I let it sit through several hung task cycles but that was all
 there was for rcu:rcu_barrier.
 
  I should have ftrace with the patch as soon as the kernel is
 done building, then I can try the below patch (I'll start it building
 now).

Sounds very good, looking forward to hearing of the results.

Going to bounce it for ftrace now, but the cpu count mismatch
seemed important enough to mention separately.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 03:34:07PM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
  On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
   On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
 
 [ . . . ]
 
 Well, if you are feeling aggressive, give the following patch a spin.
 I am doing sanity tests on it in the meantime.

Doesn't seem to make a difference here
   
   OK, inspection isn't cutting it, so time for tracing.  Does the system
   respond to user input?  If so, please enable rcu:rcu_barrier ftrace 
   before
   the problem occurs, then dump the trace buffer after the problem occurs.
  
  Sorry for being unresposive here, but I know next to nothing about tracing
  or most things about the kernel, so I have some cathing up to do.
  
  In the meantime some layman observations while I tried to find what exactly
  triggers the problem.
  - Even in runlevel 1 I can reliably trigger the problem by starting 
  libvirtd
  - libvirtd seems to be very active in using all sorts of kernel facilities
that are modules on fedora so it seems to cause many simultaneous 
  kworker 
calls to modprobe
  - there are 8 kworker/u16 from 0 to 7
  - one of these kworkers always deadlocks, while there appear to be two
kworker/u16:6 - the seventh
 
 Adding Tejun on CC in case this duplication of kworker/u16:6 is important.
 
6 vs 8 as in 6 rcuos where before they were always 8
  
  Just observations from someone who still doesn't know what the u16
  kworkers are..
 
 Could you please run the following diagnostic patch?  This will help
 me see if I have managed to miswire the rcuo kthreads.  It should
 print some information at task-hang time.
 
   Here's the output of the patch; I let it sit through two hang
 cycles.
 
   -J
 
 
 [  240.348020] INFO: task ovs-vswitchd:902 blocked for more than 120 seconds.
 [  240.354878]   Not tainted 3.17.0-testola+ #4
 [  240.359481] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
 this message.
 [  240.367285] ovs-vswitchdD 88013fc94600 0   902901 
 0x0004
 [  240.367290]  8800ab20f7b8 0002 8800b3304b00 
 8800ab20ffd8
 [  240.367293]  00014600 00014600 8800b081 
 8800b3304b00
 [  240.367296]  8800b3304b00 81c59850 81c59858 
 7fff
 [  240.367300] Call Trace:
 [  240.367307]  [81722b99] schedule+0x29/0x70
 [  240.367310]  [81725b6c] schedule_timeout+0x1dc/0x260
 [  240.367313]  [81722f69] ? _cond_resched+0x29/0x40
 [  240.367316]  [81723818] ? wait_for_completion+0x28/0x160
 [  240.367321]  [811081a7] ? queue_stop_cpus_work+0xc7/0xe0
 [  240.367324]  [81723896] wait_for_completion+0xa6/0x160
 [  240.367328]  [81099980] ? wake_up_state+0x20/0x20
 [  240.367331]  [810d0ecc] _rcu_barrier+0x20c/0x480
 [  240.367334]  [810d1195] rcu_barrier+0x15/0x20
 [  240.367338]  [81625010] netdev_run_todo+0x60/0x300
 [  240.367341]  [8162f9ee] rtnl_unlock+0xe/0x10
 [  240.367349]  [a01ffcc5] internal_dev_destroy+0x55/0x80 
 [openvswitch]
 [  240.367354]  [a01ff622] ovs_vport_del+0x32/0x40 [openvswitch]
 [  240.367358]  [a01f8dd0] ovs_dp_detach_port+0x30/0x40 
 [openvswitch]
 [  240.367363]  [a01f8ea5] ovs_vport_cmd_del+0xc5/0x110 
 [openvswitch]
 [  240.367367]  [81651d75] genl_family_rcv_msg+0x1a5/0x3c0
 [  240.367370]  [81651f90] ? genl_family_rcv_msg+0x3c0/0x3c0
 [  240.367372]  [81652021] genl_rcv_msg+0x91/0xd0
 [  240.367376]  [81650091] netlink_rcv_skb+0xc1/0xe0
 [  240.367378]  [816505bc] genl_rcv+0x2c/0x40
 [  240.367381]  [8164f626] netlink_unicast+0xf6/0x200
 [  240.367383]  [8164fa4d] netlink_sendmsg+0x31d/0x780
 [  240.367387]  [8164ca74] ? netlink_rcv_wake+0x44/0x60
 [  240.367391]  [81606a53] sock_sendmsg+0x93/0xd0
 [  240.367395]  [81337700] ? apparmor_capable+0x60/0x60
 [  240.367399]  [81614f27] ? verify_iovec+0x47/0xd0
 [  240.367402]  [81606e79] ___sys_sendmsg+0x399/0x3b0
 [  240.367406]  [812598a2] ? kernfs_seq_stop_active+0x32/0x40
 [  240.367410]  [8101c385] ? native_sched_clock+0x35/0x90
 [  240.367413]  [8101c385] ? native_sched_clock+0x35/0x90
 [  240.367416]  [8101c3e9] ? sched_clock+0x9/0x10
 [  240.367420]  [811277fc] ? acct_account_cputime+0x1c/0x20
 [  240.367424]  [8109ce6b] ? account_user_time+0x8b/0xa0
 [  240.367428]  [81200bd5] ? __fget_light+0x25/0x70
 [  240.367431]  [81607c02] __sys_sendmsg+0x42/0x80
 [  240.367433]  [81607c52] SyS_sendmsg+0x12/0x20
 [  240.367436]  [81727464] tracesys_phase2+0xd8/0xdd
 [  240.367439] rcu_show_nocb_setup(): rcu_sched nocb 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
 On Fri, Oct 24, 2014 at 03:34:07PM -0700, Jay Vosburgh wrote:
  Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
  
  On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
   On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
 On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
  
  [ . . . ]
  
  Well, if you are feeling aggressive, give the following patch a 
  spin.
  I am doing sanity tests on it in the meantime.
 
 Doesn't seem to make a difference here

OK, inspection isn't cutting it, so time for tracing.  Does the system
respond to user input?  If so, please enable rcu:rcu_barrier ftrace 
before
the problem occurs, then dump the trace buffer after the problem 
occurs.
   
   Sorry for being unresposive here, but I know next to nothing about 
   tracing
   or most things about the kernel, so I have some cathing up to do.
   
   In the meantime some layman observations while I tried to find what 
   exactly
   triggers the problem.
   - Even in runlevel 1 I can reliably trigger the problem by starting 
   libvirtd
   - libvirtd seems to be very active in using all sorts of kernel 
   facilities
 that are modules on fedora so it seems to cause many simultaneous 
   kworker 
 calls to modprobe
   - there are 8 kworker/u16 from 0 to 7
   - one of these kworkers always deadlocks, while there appear to be two
 kworker/u16:6 - the seventh
  
  Adding Tejun on CC in case this duplication of kworker/u16:6 is important.
  
 6 vs 8 as in 6 rcuos where before they were always 8
   
   Just observations from someone who still doesn't know what the u16
   kworkers are..
  
  Could you please run the following diagnostic patch?  This will help
  me see if I have managed to miswire the rcuo kthreads.  It should
  print some information at task-hang time.
  
  Here's the output of the patch; I let it sit through two hang
  cycles.
  
  -J
  
  
  [  240.348020] INFO: task ovs-vswitchd:902 blocked for more than 120 
  seconds.
  [  240.354878]   Not tainted 3.17.0-testola+ #4
  [  240.359481] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
  this message.
  [  240.367285] ovs-vswitchdD 88013fc94600 0   902901 
  0x0004
  [  240.367290]  8800ab20f7b8 0002 8800b3304b00 
  8800ab20ffd8
  [  240.367293]  00014600 00014600 8800b081 
  8800b3304b00
  [  240.367296]  8800b3304b00 81c59850 81c59858 
  7fff
  [  240.367300] Call Trace:
  [  240.367307]  [81722b99] schedule+0x29/0x70
  [  240.367310]  [81725b6c] schedule_timeout+0x1dc/0x260
  [  240.367313]  [81722f69] ? _cond_resched+0x29/0x40
  [  240.367316]  [81723818] ? wait_for_completion+0x28/0x160
  [  240.367321]  [811081a7] ? queue_stop_cpus_work+0xc7/0xe0
  [  240.367324]  [81723896] wait_for_completion+0xa6/0x160
  [  240.367328]  [81099980] ? wake_up_state+0x20/0x20
  [  240.367331]  [810d0ecc] _rcu_barrier+0x20c/0x480
  [  240.367334]  [810d1195] rcu_barrier+0x15/0x20
  [  240.367338]  [81625010] netdev_run_todo+0x60/0x300
  [  240.367341]  [8162f9ee] rtnl_unlock+0xe/0x10
  [  240.367349]  [a01ffcc5] internal_dev_destroy+0x55/0x80 
  [openvswitch]
  [  240.367354]  [a01ff622] ovs_vport_del+0x32/0x40 [openvswitch]
  [  240.367358]  [a01f8dd0] ovs_dp_detach_port+0x30/0x40 
  [openvswitch]
  [  240.367363]  [a01f8ea5] ovs_vport_cmd_del+0xc5/0x110 
  [openvswitch]
  [  240.367367]  [81651d75] genl_family_rcv_msg+0x1a5/0x3c0
  [  240.367370]  [81651f90] ? genl_family_rcv_msg+0x3c0/0x3c0
  [  240.367372]  [81652021] genl_rcv_msg+0x91/0xd0
  [  240.367376]  [81650091] netlink_rcv_skb+0xc1/0xe0
  [  240.367378]  [816505bc] genl_rcv+0x2c/0x40
  [  240.367381]  [8164f626] netlink_unicast+0xf6/0x200
  [  240.367383]  [8164fa4d] netlink_sendmsg+0x31d/0x780
  [  240.367387]  [8164ca74] ? netlink_rcv_wake+0x44/0x60
  [  240.367391]  [81606a53] sock_sendmsg+0x93/0xd0
  [  240.367395]  [81337700] ? apparmor_capable+0x60/0x60
  [  240.367399]  [81614f27] ? verify_iovec+0x47/0xd0
  [  240.367402]  [81606e79] ___sys_sendmsg+0x399/0x3b0
  [  240.367406]  [812598a2] ? kernfs_seq_stop_active+0x32/0x40
  [  240.367410]  [8101c385] ? native_sched_clock+0x35/0x90
  [  240.367413]  [8101c385] ? native_sched_clock+0x35/0x90
  [  240.367416]  [8101c3e9] ? sched_clock+0x9/0x10
  [  240.367420]  [811277fc] ? acct_account_cputime+0x1c/0x20
  [  240.367424]  [8109ce6b] ? account_user_time+0x8b/0xa0
  [  240.367428]  [81200bd5] ? __fget_light+0x25/0x70
  [  240.367431]  [81607c02] 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh
Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
[...]
 Hmmm...  It sure looks like we have some callbacks stuck here.  I clearly
 need to take a hard look at the sleep/wakeup code.
 
 Thank you for running this!!!

Could you please try the following patch?  If no joy, could you please
add rcu:rcu_nocb_wake to the list of ftrace events?

I tried the patch, it did not change the behavior.

I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints
and ran it again (with this patch and the first patch from earlier
today); the trace output is a bit on the large side so I put it and the
dmesg log at:

http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt

http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt

-J


   Thanx, Paul



rcu: Kick rcuo kthreads after their CPU goes offline

If a no-CBs CPU were to post an RCU callback with interrupts disabled
after it entered the idle loop for the last time, there might be no
deferred wakeup for the corresponding rcuo kthreads.  This commit
therefore adds a set of calls to do_nocb_deferred_wakeup() after the
CPU has gone completely offline.

Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 84b41b3c6ebd..f6880052b917 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3493,8 +3493,10 @@ static int rcu_cpu_notify(struct notifier_block *self,
   case CPU_DEAD_FROZEN:
   case CPU_UP_CANCELED:
   case CPU_UP_CANCELED_FROZEN:
-  for_each_rcu_flavor(rsp)
+  for_each_rcu_flavor(rsp) {
   rcu_cleanup_dead_cpu(cpu, rsp);
+  do_nocb_deferred_wakeup(per_cpu_ptr(rsp-rda, cpu));
+  }
   break;
   default:
   break;


---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 05:20:48PM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
 [...]
  Hmmm...  It sure looks like we have some callbacks stuck here.  I clearly
  need to take a hard look at the sleep/wakeup code.
  
  Thank you for running this!!!
 
 Could you please try the following patch?  If no joy, could you please
 add rcu:rcu_nocb_wake to the list of ftrace events?
 
   I tried the patch, it did not change the behavior.
 
   I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints
 and ran it again (with this patch and the first patch from earlier
 today); the trace output is a bit on the large side so I put it and the
 dmesg log at:
 
 http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt
 
 http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt

Thank you again!

Very strange part of the trace.  The only sign of CPU 2 and 3 are:

ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Begin 
cpu -1 remaining 0 # 0
ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Check 
cpu -1 remaining 0 # 0
ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched Inc1 cpu 
-1 remaining 0 # 1
ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
OnlineNoCB cpu 0 remaining 1 # 1
ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 0 
WakeNot
ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
OnlineNoCB cpu 1 remaining 2 # 1
ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 1 
WakeNot
ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
OnlineNoCB cpu 2 remaining 3 # 1
ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 2 
WakeNotPoll
ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
OnlineNoCB cpu 3 remaining 4 # 1
ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 3 
WakeNotPoll
ovs-vswitchd-902   [000]    109.896843: rcu_barrier: rcu_sched Inc2 cpu 
-1 remaining 4 # 2

The pair of WakeNotPoll trace entries says that at that point, RCU believed
that the CPU 2's and CPU 3's rcuo kthreads did not exist.  :-/

More diagnostics in order...

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Jay Vosburgh
Paul E. McKenney  wrote:

>On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
>> 
>> On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
>> > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
>> > > On Thu-10/23/14-2014 08:33, Paul E. McKenney wrote:
>> > > > On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
>> > > > > On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
>> > > > > > On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
>> > > > > > > On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti 
>> > > > > > > wrote:
>> > > > > > > > On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
>> > > > > > > > > On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
>> > > > > > > > >  wrote:
>> > > > > > > 
>> > > > > > > [ . . . ]
>> > > > > > > 
>> > > > > > > > > > Don't get me wrong -- the fact that this kthread 
>> > > > > > > > > > appears to
>> > > > > > > > > > have
>> > > > > > > > > > blocked within rcu_barrier() for 120 seconds means 
>> > > > > > > > > > that
>> > > > > > > > > > something is
>> > > > > > > > > > most definitely wrong here.  I am surprised that 
>> > > > > > > > > > there are no
>> > > > > > > > > > RCU CPU
>> > > > > > > > > > stall warnings, but perhaps the blockage is in the 
>> > > > > > > > > > callback
>> > > > > > > > > > execution
>> > > > > > > > > > rather than grace-period completion.  Or something is
>> > > > > > > > > > preventing this
>> > > > > > > > > > kthread from starting up after the wake-up callback 
>> > > > > > > > > > executes.
>> > > > > > > > > > Or...
>> > > > > > > > > > 
>> > > > > > > > > > Is this thing reproducible?
>> > > > > > > > > 
>> > > > > > > > > I've added Yanko on CC, who reported the backtrace 
>> > > > > > > > > above and can
>> > > > > > > > > recreate it reliably.  Apparently reverting the RCU 
>> > > > > > > > > merge commit
>> > > > > > > > > (d6dd50e) and rebuilding the latest after that does 
>> > > > > > > > > not show the
>> > > > > > > > > issue.  I'll let Yanko explain more and answer any 
>> > > > > > > > > questions you
>> > > > > > > > > have.
>> > > > > > > > 
>> > > > > > > > - It is reproducible
>> > > > > > > > - I've done another build here to double check and its 
>> > > > > > > > definitely
>> > > > > > > > the rcu merge
>> > > > > > > >   that's causing it.
>> > > > > > > > 
>> > > > > > > > Don't think I'll be able to dig deeper, but I can do 
>> > > > > > > > testing if
>> > > > > > > > needed.
>> > > > > > > 
>> > > > > > > Please!  Does the following patch help?
>> > > > > > 
>> > > > > > Nope, doesn't seem to make a difference to the modprobe 
>> > > > > > ppp_generic
>> > > > > > test
>> > > > > 
>> > > > > Well, I was hoping.  I will take a closer look at the RCU 
>> > > > > merge commit
>> > > > > and see what suggests itself.  I am likely to ask you to 
>> > > > > revert specific
>> > > > > commits, if that works for you.
>> > > > 
>> > > > Well, rather than reverting commits, could you please try 
>> > > > testing the
>> > > > following commits?
>> > > > 
>> > > > 11ed7f934cb8 (rcu: Make nocb leader kthreads process pending 
>> > > > callbacks after spawning)
>> > > > 
>> > > > 73a860cd58a1 (rcu: Replace flush_signals() with 
>> > > > WARN_ON(signal_pending()))
>> > > > 
>> > > > c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())
>> > > > 
>> > > > For whatever it is worth, I am guessing this one.
>> > > 
>> > > Indeed, c847f14217d5 it is.
>> > > 
>> > > Much to my embarrasment I just noticed that in addition to the
>> > > rcu merge, triggering the bug "requires" my specific Fedora 
>> > > rawhide network
>> > > setup. Booting in single mode and modprobe ppp_generic is fine. 
>> > > The bug
>> > > appears when starting with my regular fedora network setup, which 
>> > > in my case
>> > > includes 3 ethernet adapters and a libvirt birdge+nat setup.
>> > > 
>> > > Hope that helps.
>> > > 
>> > > I am attaching the config.
>> > 
>> > It does help a lot, thank you!!!
>> > 
>> > The following patch is a bit of a shot in the dark, and assumes that
>> > commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled 
>> > idle
>> > code) introduced the problem.  Does this patch fix things up?
>> 
>> Unfortunately not, This is linus-tip + patch
>
>OK.  Can't have everything, I guess.
>
>> INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
>>   Not tainted 3.18.0-rc1+ #4
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> kworker/u16:6   D 8800ca84cec0 1116896  2 0x
>> Workqueue: netns cleanup_net
>>  8802218339e8 0096 8800ca84cec0 001d5f00
>>  880221833fd8 001d5f00 880223264ec0 8800ca84cec0
>>  82c52040 7fff 81ee2658 81ee2650
>> Call Trace:
>>  [] schedule+0x29/0x70
>>  [] schedule_timeout+0x26c/0x410
>>  [] ? native_sched_clock+0x2a/0xa0
>>  [] ? mark_held_locks+0x7c/0xb0
>>  [] ? 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
> 
> On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
> > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
> > > On Thu-10/23/14-2014 08:33, Paul E. McKenney wrote:
> > > > On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
> > > > > On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
> > > > > > On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
> > > > > > > On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti 
> > > > > > > wrote:
> > > > > > > > On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
> > > > > > > > > On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
> > > > > > > > >  wrote:
> > > > > > > 
> > > > > > > [ . . . ]
> > > > > > > 
> > > > > > > > > > Don't get me wrong -- the fact that this kthread 
> > > > > > > > > > appears to
> > > > > > > > > > have
> > > > > > > > > > blocked within rcu_barrier() for 120 seconds means 
> > > > > > > > > > that
> > > > > > > > > > something is
> > > > > > > > > > most definitely wrong here.  I am surprised that 
> > > > > > > > > > there are no
> > > > > > > > > > RCU CPU
> > > > > > > > > > stall warnings, but perhaps the blockage is in the 
> > > > > > > > > > callback
> > > > > > > > > > execution
> > > > > > > > > > rather than grace-period completion.  Or something is
> > > > > > > > > > preventing this
> > > > > > > > > > kthread from starting up after the wake-up callback 
> > > > > > > > > > executes.
> > > > > > > > > > Or...
> > > > > > > > > > 
> > > > > > > > > > Is this thing reproducible?
> > > > > > > > > 
> > > > > > > > > I've added Yanko on CC, who reported the backtrace 
> > > > > > > > > above and can
> > > > > > > > > recreate it reliably.  Apparently reverting the RCU 
> > > > > > > > > merge commit
> > > > > > > > > (d6dd50e) and rebuilding the latest after that does 
> > > > > > > > > not show the
> > > > > > > > > issue.  I'll let Yanko explain more and answer any 
> > > > > > > > > questions you
> > > > > > > > > have.
> > > > > > > > 
> > > > > > > > - It is reproducible
> > > > > > > > - I've done another build here to double check and its 
> > > > > > > > definitely
> > > > > > > > the rcu merge
> > > > > > > >   that's causing it.
> > > > > > > > 
> > > > > > > > Don't think I'll be able to dig deeper, but I can do 
> > > > > > > > testing if
> > > > > > > > needed.
> > > > > > > 
> > > > > > > Please!  Does the following patch help?
> > > > > > 
> > > > > > Nope, doesn't seem to make a difference to the modprobe 
> > > > > > ppp_generic
> > > > > > test
> > > > > 
> > > > > Well, I was hoping.  I will take a closer look at the RCU 
> > > > > merge commit
> > > > > and see what suggests itself.  I am likely to ask you to 
> > > > > revert specific
> > > > > commits, if that works for you.
> > > > 
> > > > Well, rather than reverting commits, could you please try 
> > > > testing the
> > > > following commits?
> > > > 
> > > > 11ed7f934cb8 (rcu: Make nocb leader kthreads process pending 
> > > > callbacks after spawning)
> > > > 
> > > > 73a860cd58a1 (rcu: Replace flush_signals() with 
> > > > WARN_ON(signal_pending()))
> > > > 
> > > > c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())
> > > > 
> > > > For whatever it is worth, I am guessing this one.
> > > 
> > > Indeed, c847f14217d5 it is.
> > > 
> > > Much to my embarrasment I just noticed that in addition to the
> > > rcu merge, triggering the bug "requires" my specific Fedora 
> > > rawhide network
> > > setup. Booting in single mode and modprobe ppp_generic is fine. 
> > > The bug
> > > appears when starting with my regular fedora network setup, which 
> > > in my case
> > > includes 3 ethernet adapters and a libvirt birdge+nat setup.
> > > 
> > > Hope that helps.
> > > 
> > > I am attaching the config.
> > 
> > It does help a lot, thank you!!!
> > 
> > The following patch is a bit of a shot in the dark, and assumes that
> > commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled 
> > idle
> > code) introduced the problem.  Does this patch fix things up?
> 
> Unfortunately not, This is linus-tip + patch

OK.  Can't have everything, I guess.

> INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
>   Not tainted 3.18.0-rc1+ #4
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/u16:6   D 8800ca84cec0 1116896  2 0x
> Workqueue: netns cleanup_net
>  8802218339e8 0096 8800ca84cec0 001d5f00
>  880221833fd8 001d5f00 880223264ec0 8800ca84cec0
>  82c52040 7fff 81ee2658 81ee2650
> Call Trace:
>  [] schedule+0x29/0x70
>  [] schedule_timeout+0x26c/0x410
>  [] ? native_sched_clock+0x2a/0xa0
>  [] ? mark_held_locks+0x7c/0xb0
>  [] ? _raw_spin_unlock_irq+0x30/0x50
>  [] ? trace_hardirqs_on_caller+0x15d/0x200
>  [] wait_for_completion+0x10c/0x150
>  [] ? wake_up_state+0x20/0x20
>  [] 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Yanko Kaneti

On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
> On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
> > On Thu-10/23/14-2014 08:33, Paul E. McKenney wrote:
> > > On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
> > > > > On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
> > > > > > On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti 
> > > > > > wrote:
> > > > > > > On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
> > > > > > > > On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
> > > > > > > >  wrote:
> > > > > > 
> > > > > > [ . . . ]
> > > > > > 
> > > > > > > > > Don't get me wrong -- the fact that this kthread 
> > > > > > > > > appears to
> > > > > > > > > have
> > > > > > > > > blocked within rcu_barrier() for 120 seconds means 
> > > > > > > > > that
> > > > > > > > > something is
> > > > > > > > > most definitely wrong here.  I am surprised that 
> > > > > > > > > there are no
> > > > > > > > > RCU CPU
> > > > > > > > > stall warnings, but perhaps the blockage is in the 
> > > > > > > > > callback
> > > > > > > > > execution
> > > > > > > > > rather than grace-period completion.  Or something is
> > > > > > > > > preventing this
> > > > > > > > > kthread from starting up after the wake-up callback 
> > > > > > > > > executes.
> > > > > > > > > Or...
> > > > > > > > > 
> > > > > > > > > Is this thing reproducible?
> > > > > > > > 
> > > > > > > > I've added Yanko on CC, who reported the backtrace 
> > > > > > > > above and can
> > > > > > > > recreate it reliably.  Apparently reverting the RCU 
> > > > > > > > merge commit
> > > > > > > > (d6dd50e) and rebuilding the latest after that does 
> > > > > > > > not show the
> > > > > > > > issue.  I'll let Yanko explain more and answer any 
> > > > > > > > questions you
> > > > > > > > have.
> > > > > > > 
> > > > > > > - It is reproducible
> > > > > > > - I've done another build here to double check and its 
> > > > > > > definitely
> > > > > > > the rcu merge
> > > > > > >   that's causing it.
> > > > > > > 
> > > > > > > Don't think I'll be able to dig deeper, but I can do 
> > > > > > > testing if
> > > > > > > needed.
> > > > > > 
> > > > > > Please!  Does the following patch help?
> > > > > 
> > > > > Nope, doesn't seem to make a difference to the modprobe 
> > > > > ppp_generic
> > > > > test
> > > > 
> > > > Well, I was hoping.  I will take a closer look at the RCU 
> > > > merge commit
> > > > and see what suggests itself.  I am likely to ask you to 
> > > > revert specific
> > > > commits, if that works for you.
> > > 
> > > Well, rather than reverting commits, could you please try 
> > > testing the
> > > following commits?
> > > 
> > > 11ed7f934cb8 (rcu: Make nocb leader kthreads process pending 
> > > callbacks after spawning)
> > > 
> > > 73a860cd58a1 (rcu: Replace flush_signals() with 
> > > WARN_ON(signal_pending()))
> > > 
> > > c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())
> > > 
> > > For whatever it is worth, I am guessing this one.
> > 
> > Indeed, c847f14217d5 it is.
> > 
> > Much to my embarrasment I just noticed that in addition to the
> > rcu merge, triggering the bug "requires" my specific Fedora 
> > rawhide network
> > setup. Booting in single mode and modprobe ppp_generic is fine. 
> > The bug
> > appears when starting with my regular fedora network setup, which 
> > in my case
> > includes 3 ethernet adapters and a libvirt birdge+nat setup.
> > 
> > Hope that helps.
> > 
> > I am attaching the config.
> 
> It does help a lot, thank you!!!
> 
> The following patch is a bit of a shot in the dark, and assumes that
> commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled 
> idle
> code) introduced the problem.  Does this patch fix things up?

Unfortunately not, This is linus-tip + patch


INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
  Not tainted 3.18.0-rc1+ #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:6   D 8800ca84cec0 1116896  2 0x
Workqueue: netns cleanup_net
 8802218339e8 0096 8800ca84cec0 001d5f00
 880221833fd8 001d5f00 880223264ec0 8800ca84cec0
 82c52040 7fff 81ee2658 81ee2650
Call Trace:
 [] schedule+0x29/0x70
 [] schedule_timeout+0x26c/0x410
 [] ? native_sched_clock+0x2a/0xa0
 [] ? mark_held_locks+0x7c/0xb0
 [] ? _raw_spin_unlock_irq+0x30/0x50
 [] ? trace_hardirqs_on_caller+0x15d/0x200
 [] wait_for_completion+0x10c/0x150
 [] ? wake_up_state+0x20/0x20
 [] _rcu_barrier+0x159/0x200
 [] rcu_barrier+0x15/0x20
 [] netdev_run_todo+0x6f/0x310
 [] ? rollback_registered_many+0x265/0x2e0
 [] rtnl_unlock+0xe/0x10
 [] default_device_exit_batch+0x156/0x180
 [] ? abort_exclusive_wait+0xb0/0xb0
 [] ops_exit_list.isra.1+0x53/0x60
 [] cleanup_net+0x100/0x1f0
 [] process_one_work+0x218/0x850
 [] ? 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Paul E. McKenney
On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
> On Thu-10/23/14-2014 08:33, Paul E. McKenney wrote:
> > On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
> > > On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
> > > > On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
> > > > > On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti wrote:
> > > > > > On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
> > > > > > > On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
> > > > > > >  wrote:
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > > > Don't get me wrong -- the fact that this kthread appears to 
> > > > > > > > have
> > > > > > > > blocked within rcu_barrier() for 120 seconds means that 
> > > > > > > > something is
> > > > > > > > most definitely wrong here.  I am surprised that there are no 
> > > > > > > > RCU CPU
> > > > > > > > stall warnings, but perhaps the blockage is in the callback 
> > > > > > > > execution
> > > > > > > > rather than grace-period completion.  Or something is 
> > > > > > > > preventing this
> > > > > > > > kthread from starting up after the wake-up callback executes.  
> > > > > > > > Or...
> > > > > > > > 
> > > > > > > > Is this thing reproducible?
> > > > > > > 
> > > > > > > I've added Yanko on CC, who reported the backtrace above and can
> > > > > > > recreate it reliably.  Apparently reverting the RCU merge commit
> > > > > > > (d6dd50e) and rebuilding the latest after that does not show the
> > > > > > > issue.  I'll let Yanko explain more and answer any questions you 
> > > > > > > have.
> > > > > > 
> > > > > > - It is reproducible
> > > > > > - I've done another build here to double check and its definitely 
> > > > > > the rcu merge
> > > > > >   that's causing it.
> > > > > > 
> > > > > > Don't think I'll be able to dig deeper, but I can do testing if 
> > > > > > needed.
> > > > > 
> > > > > Please!  Does the following patch help?
> > > > 
> > > > Nope, doesn't seem to make a difference to the modprobe ppp_generic 
> > > > test
> > > 
> > > Well, I was hoping.  I will take a closer look at the RCU merge commit
> > > and see what suggests itself.  I am likely to ask you to revert specific
> > > commits, if that works for you.
> > 
> > Well, rather than reverting commits, could you please try testing the
> > following commits?
> > 
> > 11ed7f934cb8 (rcu: Make nocb leader kthreads process pending callbacks 
> > after spawning)
> > 
> > 73a860cd58a1 (rcu: Replace flush_signals() with WARN_ON(signal_pending()))
> > 
> > c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())
> > 
> > For whatever it is worth, I am guessing this one.
> 
> Indeed, c847f14217d5 it is.
> 
> Much to my embarrasment I just noticed that in addition to the
> rcu merge, triggering the bug "requires" my specific Fedora rawhide network
> setup. Booting in single mode and modprobe ppp_generic is fine. The bug
> appears when starting with my regular fedora network setup, which in my case 
> includes 3 ethernet adapters and a libvirt birdge+nat setup.
> 
> Hope that helps. 
> 
> I am attaching the config.

It does help a lot, thank you!!!

The following patch is a bit of a shot in the dark, and assumes that
commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled idle
code) introduced the problem.  Does this patch fix things up?

Thanx, Paul



rcu: Kick rcuo kthreads after their CPU goes offline

If a no-CBs CPU were to post an RCU callback with interrupts disabled
after it entered the idle loop for the last time, there might be no
deferred wakeup for the corresponding rcuo kthreads.  This commit
therefore adds a set of calls to do_nocb_deferred_wakeup() after the
CPU has gone completely offline.

Signed-off-by: Paul E. McKenney 

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 84b41b3c6ebd..4f3d25a58786 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3493,8 +3493,10 @@ static int rcu_cpu_notify(struct notifier_block *self,
case CPU_DEAD_FROZEN:
case CPU_UP_CANCELED:
case CPU_UP_CANCELED_FROZEN:
-   for_each_rcu_flavor(rsp)
+   for_each_rcu_flavor(rsp) {
rcu_cleanup_dead_cpu(cpu, rsp);
+   do_nocb_deferred_wakeup(this_cpu_ptr(rsp->rda));
+   }
break;
default:
break;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Paul E. McKenney
On Thu, Oct 23, 2014 at 12:11:26PM -0400, Josh Boyer wrote:
> On Oct 23, 2014 11:37 AM, "Paul E. McKenney" 
> wrote:
> >
> > On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
> > > On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
> > > > On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
> > > > > On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti wrote:
> > > > > > On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
> > > > > > > On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
> > > > > > >  wrote:
> > > > >
> > > > > [ . . . ]
> > > > >
> > > > > > > > Don't get me wrong -- the fact that this kthread appears to
> > > > > > > > have
> > > > > > > > blocked within rcu_barrier() for 120 seconds means that
> > > > > > > > something is
> > > > > > > > most definitely wrong here.  I am surprised that there are no
> > > > > > > > RCU CPU
> > > > > > > > stall warnings, but perhaps the blockage is in the callback
> > > > > > > > execution
> > > > > > > > rather than grace-period completion.  Or something is
> > > > > > > > preventing this
> > > > > > > > kthread from starting up after the wake-up callback executes.
> > > > > > > > Or...
> > > > > > > >
> > > > > > > > Is this thing reproducible?
> > > > > > >
> > > > > > > I've added Yanko on CC, who reported the backtrace above and can
> > > > > > > recreate it reliably.  Apparently reverting the RCU merge commit
> > > > > > > (d6dd50e) and rebuilding the latest after that does not show the
> > > > > > > issue.  I'll let Yanko explain more and answer any questions you
> > > > > > > have.
> > > > > >
> > > > > > - It is reproducible
> > > > > > - I've done another build here to double check and its definitely
> > > > > > the rcu merge
> > > > > >   that's causing it.
> > > > > >
> > > > > > Don't think I'll be able to dig deeper, but I can do testing if
> > > > > > needed.
> > > > >
> > > > > Please!  Does the following patch help?
> > > >
> > > > Nope, doesn't seem to make a difference to the modprobe ppp_generic
> > > > test
> > >
> > > Well, I was hoping.  I will take a closer look at the RCU merge commit
> > > and see what suggests itself.  I am likely to ask you to revert specific
> > > commits, if that works for you.
> >
> > Well, rather than reverting commits, could you please try testing the
> > following commits?
> >
> > 11ed7f934cb8 (rcu: Make nocb leader kthreads process pending callbacks
> after spawning)
> >
> > 73a860cd58a1 (rcu: Replace flush_signals() with WARN_ON(signal_pending()))
> >
> > c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())
> >
> > For whatever it is worth, I am guessing this one.
> >
> > a53dd6a65668 (rcutorture: Add RCU-tasks tests to default rcutorture list)
> >
> > If any of the above fail, this one should also fail.
> >
> > Also, could you please send along your .config?
> 
> Which tree are those in?

They are all in Linus's tree.  They are topic branches of the RCU merge
commit (d6dd50e), and the test results will hopefully give me more of a
clue where to look.  As would the .config file.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Paul E. McKenney
On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
> > On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
> > > On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti wrote:
> > > > On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
> > > > > On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
> > > > >  wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > > > Don't get me wrong -- the fact that this kthread appears to 
> > > > > > have
> > > > > > blocked within rcu_barrier() for 120 seconds means that 
> > > > > > something is
> > > > > > most definitely wrong here.  I am surprised that there are no 
> > > > > > RCU CPU
> > > > > > stall warnings, but perhaps the blockage is in the callback 
> > > > > > execution
> > > > > > rather than grace-period completion.  Or something is 
> > > > > > preventing this
> > > > > > kthread from starting up after the wake-up callback executes.  
> > > > > > Or...
> > > > > > 
> > > > > > Is this thing reproducible?
> > > > > 
> > > > > I've added Yanko on CC, who reported the backtrace above and can
> > > > > recreate it reliably.  Apparently reverting the RCU merge commit
> > > > > (d6dd50e) and rebuilding the latest after that does not show the
> > > > > issue.  I'll let Yanko explain more and answer any questions you 
> > > > > have.
> > > > 
> > > > - It is reproducible
> > > > - I've done another build here to double check and its definitely 
> > > > the rcu merge
> > > >   that's causing it.
> > > > 
> > > > Don't think I'll be able to dig deeper, but I can do testing if 
> > > > needed.
> > > 
> > > Please!  Does the following patch help?
> > 
> > Nope, doesn't seem to make a difference to the modprobe ppp_generic 
> > test
> 
> Well, I was hoping.  I will take a closer look at the RCU merge commit
> and see what suggests itself.  I am likely to ask you to revert specific
> commits, if that works for you.

Well, rather than reverting commits, could you please try testing the
following commits?

11ed7f934cb8 (rcu: Make nocb leader kthreads process pending callbacks after 
spawning)

73a860cd58a1 (rcu: Replace flush_signals() with WARN_ON(signal_pending()))

c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())

For whatever it is worth, I am guessing this one.

a53dd6a65668 (rcutorture: Add RCU-tasks tests to default rcutorture list)

If any of the above fail, this one should also fail.

Also, could you please send along your .config?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Paul E. McKenney
On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
> On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
> > On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti wrote:
> > > On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
> > > > On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
> > > >  wrote:
> > 
> > [ . . . ]
> > 
> > > > > Don't get me wrong -- the fact that this kthread appears to 
> > > > > have
> > > > > blocked within rcu_barrier() for 120 seconds means that 
> > > > > something is
> > > > > most definitely wrong here.  I am surprised that there are no 
> > > > > RCU CPU
> > > > > stall warnings, but perhaps the blockage is in the callback 
> > > > > execution
> > > > > rather than grace-period completion.  Or something is 
> > > > > preventing this
> > > > > kthread from starting up after the wake-up callback executes.  
> > > > > Or...
> > > > > 
> > > > > Is this thing reproducible?
> > > > 
> > > > I've added Yanko on CC, who reported the backtrace above and can
> > > > recreate it reliably.  Apparently reverting the RCU merge commit
> > > > (d6dd50e) and rebuilding the latest after that does not show the
> > > > issue.  I'll let Yanko explain more and answer any questions you 
> > > > have.
> > > 
> > > - It is reproducible
> > > - I've done another build here to double check and its definitely 
> > > the rcu merge
> > >   that's causing it.
> > > 
> > > Don't think I'll be able to dig deeper, but I can do testing if 
> > > needed.
> > 
> > Please!  Does the following patch help?
> 
> Nope, doesn't seem to make a difference to the modprobe ppp_generic 
> test

Well, I was hoping.  I will take a closer look at the RCU merge commit
and see what suggests itself.  I am likely to ask you to revert specific
commits, if that works for you.

Thanx, Paul

> INFO: task kworker/u16:6:101 blocked for more than 120 seconds.
>   Not tainted 3.18.0-0.rc1.git2.3.fc22.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
> message.
> kworker/u16:6   D 88022067cec0 11680   101  2 0x
> Workqueue: netns cleanup_net
>  8802206939e8 0096 88022067cec0 001d5f00
>  880220693fd8 001d5f00 880223263480 88022067cec0
>  82c51d60 7fff 81ee2698 81ee2690
> Call Trace:
>  [] schedule+0x29/0x70
>  [] schedule_timeout+0x26c/0x410
>  [] ? native_sched_clock+0x2a/0xa0
>  [] ? mark_held_locks+0x7c/0xb0
>  [] ? _raw_spin_unlock_irq+0x30/0x50
>  [] ? trace_hardirqs_on_caller+0x15d/0x200
>  [] wait_for_completion+0x10c/0x150
>  [] ? wake_up_state+0x20/0x20
>  [] _rcu_barrier+0x159/0x200
>  [] rcu_barrier+0x15/0x20
>  [] netdev_run_todo+0x6f/0x310
>  [] ? rollback_registered_many+0x265/0x2e0
>  [] rtnl_unlock+0xe/0x10
>  [] default_device_exit_batch+0x156/0x180
>  [] ? abort_exclusive_wait+0xb0/0xb0
>  [] ops_exit_list.isra.1+0x53/0x60
>  [] cleanup_net+0x100/0x1f0
>  [] process_one_work+0x218/0x850
>  [] ? process_one_work+0x17f/0x850
>  [] ? worker_thread+0xe7/0x4a0
>  [] worker_thread+0x6b/0x4a0
>  [] ? process_one_work+0x850/0x850
>  [] kthread+0x10b/0x130
>  [] ? sched_clock+0x9/0x10
>  [] ? kthread_create_on_node+0x250/0x250
>  [] ret_from_fork+0x7c/0xb0
>  [] ? kthread_create_on_node+0x250/0x250
> 4 locks held by kworker/u16:6/101:
>  #0:  ("%s""netns"){.+.+.+}, at: [] 
> process_one_work+0x17f/0x850
>  #1:  (net_cleanup_work){+.+.+.}, at: [] 
> process_one_work+0x17f/0x850
>  #2:  (net_mutex){+.+.+.}, at: [] cleanup_net+0x8c/0x1f0
>  #3:  (rcu_sched_state.barrier_mutex){+.+...}, at: [] 
> _rcu_barrier+0x35/0x200
> INFO: task modprobe:1139 blocked for more than 120 seconds.
>   Not tainted 3.18.0-0.rc1.git2.3.fc22.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
> message.
> modprobeD 880213ac1a40 13112  1139   1138 0x0080
>  880036ab3be8 0096 880213ac1a40 001d5f00
>  880036ab3fd8 001d5f00 880223264ec0 880213ac1a40
>  880213ac1a40 81f8fb48 0246 880213ac1a40
> Call Trace:
>  [] schedule_preempt_disabled+0x31/0x80
>  [] mutex_lock_nested+0x183/0x440
>  [] ? register_pernet_subsys+0x1f/0x50
>  [] ? register_pernet_subsys+0x1f/0x50
>  [] ? 0xa06f3000
>  [] register_pernet_subsys+0x1f/0x50
>  [] br_init+0x48/0xd3 [bridge]
>  [] do_one_initcall+0xd8/0x210
>  [] load_module+0x20c2/0x2870
>  [] ? store_uevent+0x70/0x70
>  [] ? lock_release_non_nested+0x3c6/0x3d0
>  [] SyS_init_module+0xe7/0x140
>  [] system_call_fastpath+0x12/0x17
> 1 lock held by modprobe/1139:
>  #0:  (net_mutex){+.+.+.}, at: [] 
> register_pernet_subsys+0x1f/0x50
> INFO: task modprobe:1209 blocked for more than 120 seconds.
>   Not tainted 3.18.0-0.rc1.git2.3.fc22.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
> message.
> modprobeD 8800c5324ec0 13368  1209   1151 0x0080
>  

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Yanko Kaneti
On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
> On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti wrote:
> > On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
> > > On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
> > >  wrote:
> 
> [ . . . ]
> 
> > > > Don't get me wrong -- the fact that this kthread appears to 
> > > > have
> > > > blocked within rcu_barrier() for 120 seconds means that 
> > > > something is
> > > > most definitely wrong here.  I am surprised that there are no 
> > > > RCU CPU
> > > > stall warnings, but perhaps the blockage is in the callback 
> > > > execution
> > > > rather than grace-period completion.  Or something is 
> > > > preventing this
> > > > kthread from starting up after the wake-up callback executes.  
> > > > Or...
> > > > 
> > > > Is this thing reproducible?
> > > 
> > > I've added Yanko on CC, who reported the backtrace above and can
> > > recreate it reliably.  Apparently reverting the RCU merge commit
> > > (d6dd50e) and rebuilding the latest after that does not show the
> > > issue.  I'll let Yanko explain more and answer any questions you 
> > > have.
> > 
> > - It is reproducible
> > - I've done another build here to double check and its definitely 
> > the rcu merge
> >   that's causing it.
> > 
> > Don't think I'll be able to dig deeper, but I can do testing if 
> > needed.
> 
> Please!  Does the following patch help?

Nope, doesn't seem to make a difference to the modprobe ppp_generic 
test


INFO: task kworker/u16:6:101 blocked for more than 120 seconds.
  Not tainted 3.18.0-0.rc1.git2.3.fc22.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
kworker/u16:6   D 88022067cec0 11680   101  2 0x
Workqueue: netns cleanup_net
 8802206939e8 0096 88022067cec0 001d5f00
 880220693fd8 001d5f00 880223263480 88022067cec0
 82c51d60 7fff 81ee2698 81ee2690
Call Trace:
 [] schedule+0x29/0x70
 [] schedule_timeout+0x26c/0x410
 [] ? native_sched_clock+0x2a/0xa0
 [] ? mark_held_locks+0x7c/0xb0
 [] ? _raw_spin_unlock_irq+0x30/0x50
 [] ? trace_hardirqs_on_caller+0x15d/0x200
 [] wait_for_completion+0x10c/0x150
 [] ? wake_up_state+0x20/0x20
 [] _rcu_barrier+0x159/0x200
 [] rcu_barrier+0x15/0x20
 [] netdev_run_todo+0x6f/0x310
 [] ? rollback_registered_many+0x265/0x2e0
 [] rtnl_unlock+0xe/0x10
 [] default_device_exit_batch+0x156/0x180
 [] ? abort_exclusive_wait+0xb0/0xb0
 [] ops_exit_list.isra.1+0x53/0x60
 [] cleanup_net+0x100/0x1f0
 [] process_one_work+0x218/0x850
 [] ? process_one_work+0x17f/0x850
 [] ? worker_thread+0xe7/0x4a0
 [] worker_thread+0x6b/0x4a0
 [] ? process_one_work+0x850/0x850
 [] kthread+0x10b/0x130
 [] ? sched_clock+0x9/0x10
 [] ? kthread_create_on_node+0x250/0x250
 [] ret_from_fork+0x7c/0xb0
 [] ? kthread_create_on_node+0x250/0x250
4 locks held by kworker/u16:6/101:
 #0:  ("%s""netns"){.+.+.+}, at: [] 
process_one_work+0x17f/0x850
 #1:  (net_cleanup_work){+.+.+.}, at: [] 
process_one_work+0x17f/0x850
 #2:  (net_mutex){+.+.+.}, at: [] cleanup_net+0x8c/0x1f0
 #3:  (rcu_sched_state.barrier_mutex){+.+...}, at: [] 
_rcu_barrier+0x35/0x200
INFO: task modprobe:1139 blocked for more than 120 seconds.
  Not tainted 3.18.0-0.rc1.git2.3.fc22.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
modprobeD 880213ac1a40 13112  1139   1138 0x0080
 880036ab3be8 0096 880213ac1a40 001d5f00
 880036ab3fd8 001d5f00 880223264ec0 880213ac1a40
 880213ac1a40 81f8fb48 0246 880213ac1a40
Call Trace:
 [] schedule_preempt_disabled+0x31/0x80
 [] mutex_lock_nested+0x183/0x440
 [] ? register_pernet_subsys+0x1f/0x50
 [] ? register_pernet_subsys+0x1f/0x50
 [] ? 0xa06f3000
 [] register_pernet_subsys+0x1f/0x50
 [] br_init+0x48/0xd3 [bridge]
 [] do_one_initcall+0xd8/0x210
 [] load_module+0x20c2/0x2870
 [] ? store_uevent+0x70/0x70
 [] ? lock_release_non_nested+0x3c6/0x3d0
 [] SyS_init_module+0xe7/0x140
 [] system_call_fastpath+0x12/0x17
1 lock held by modprobe/1139:
 #0:  (net_mutex){+.+.+.}, at: [] 
register_pernet_subsys+0x1f/0x50
INFO: task modprobe:1209 blocked for more than 120 seconds.
  Not tainted 3.18.0-0.rc1.git2.3.fc22.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
modprobeD 8800c5324ec0 13368  1209   1151 0x0080
 88020d14bbe8 0096 8800c5324ec0 001d5f00
 88020d14bfd8 001d5f00 88022328 8800c5324ec0
 8800c5324ec0 81f8fb48 0246 8800c5324ec0
Call Trace:
 [] schedule_preempt_disabled+0x31/0x80
 [] mutex_lock_nested+0x183/0x440
 [] ? register_pernet_device+0x1d/0x70
 [] ? register_pernet_device+0x1d/0x70
 [] ? 0xa070f000
 [] register_pernet_device+0x1d/0x70
 [] ppp_init+0x20/0x1000 [ppp_generic]
 [] do_one_initcall+0xd8/0x210
 [] load_module+0x20c2/0x2870
 [] ? 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Yanko Kaneti
On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
 On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti wrote:
  On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
   On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
   paul...@linux.vnet.ibm.com wrote:
 
 [ . . . ]
 
Don't get me wrong -- the fact that this kthread appears to 
have
blocked within rcu_barrier() for 120 seconds means that 
something is
most definitely wrong here.  I am surprised that there are no 
RCU CPU
stall warnings, but perhaps the blockage is in the callback 
execution
rather than grace-period completion.  Or something is 
preventing this
kthread from starting up after the wake-up callback executes.  
Or...

Is this thing reproducible?
   
   I've added Yanko on CC, who reported the backtrace above and can
   recreate it reliably.  Apparently reverting the RCU merge commit
   (d6dd50e) and rebuilding the latest after that does not show the
   issue.  I'll let Yanko explain more and answer any questions you 
   have.
  
  - It is reproducible
  - I've done another build here to double check and its definitely 
  the rcu merge
that's causing it.
  
  Don't think I'll be able to dig deeper, but I can do testing if 
  needed.
 
 Please!  Does the following patch help?

Nope, doesn't seem to make a difference to the modprobe ppp_generic 
test


INFO: task kworker/u16:6:101 blocked for more than 120 seconds.
  Not tainted 3.18.0-0.rc1.git2.3.fc22.x86_64 #1
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
kworker/u16:6   D 88022067cec0 11680   101  2 0x
Workqueue: netns cleanup_net
 8802206939e8 0096 88022067cec0 001d5f00
 880220693fd8 001d5f00 880223263480 88022067cec0
 82c51d60 7fff 81ee2698 81ee2690
Call Trace:
 [8185e289] schedule+0x29/0x70
 [818634ac] schedule_timeout+0x26c/0x410
 [81028c4a] ? native_sched_clock+0x2a/0xa0
 [81107afc] ? mark_held_locks+0x7c/0xb0
 [81864530] ? _raw_spin_unlock_irq+0x30/0x50
 [81107c8d] ? trace_hardirqs_on_caller+0x15d/0x200
 [8185fcbc] wait_for_completion+0x10c/0x150
 [810e5430] ? wake_up_state+0x20/0x20
 [8112a799] _rcu_barrier+0x159/0x200
 [8112a895] rcu_barrier+0x15/0x20
 [81718f0f] netdev_run_todo+0x6f/0x310
 [8170dad5] ? rollback_registered_many+0x265/0x2e0
 [81725f7e] rtnl_unlock+0xe/0x10
 [8170f936] default_device_exit_batch+0x156/0x180
 [810fd8f0] ? abort_exclusive_wait+0xb0/0xb0
 [817079e3] ops_exit_list.isra.1+0x53/0x60
 [81708590] cleanup_net+0x100/0x1f0
 [810ccff8] process_one_work+0x218/0x850
 [810ccf5f] ? process_one_work+0x17f/0x850
 [810cd717] ? worker_thread+0xe7/0x4a0
 [810cd69b] worker_thread+0x6b/0x4a0
 [810cd630] ? process_one_work+0x850/0x850
 [810d39eb] kthread+0x10b/0x130
 [81028cc9] ? sched_clock+0x9/0x10
 [810d38e0] ? kthread_create_on_node+0x250/0x250
 [8186527c] ret_from_fork+0x7c/0xb0
 [810d38e0] ? kthread_create_on_node+0x250/0x250
4 locks held by kworker/u16:6/101:
 #0:  (%snetns){.+.+.+}, at: [810ccf5f] 
process_one_work+0x17f/0x850
 #1:  (net_cleanup_work){+.+.+.}, at: [810ccf5f] 
process_one_work+0x17f/0x850
 #2:  (net_mutex){+.+.+.}, at: [8170851c] cleanup_net+0x8c/0x1f0
 #3:  (rcu_sched_state.barrier_mutex){+.+...}, at: [8112a675] 
_rcu_barrier+0x35/0x200
INFO: task modprobe:1139 blocked for more than 120 seconds.
  Not tainted 3.18.0-0.rc1.git2.3.fc22.x86_64 #1
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
modprobeD 880213ac1a40 13112  1139   1138 0x0080
 880036ab3be8 0096 880213ac1a40 001d5f00
 880036ab3fd8 001d5f00 880223264ec0 880213ac1a40
 880213ac1a40 81f8fb48 0246 880213ac1a40
Call Trace:
 [8185e831] schedule_preempt_disabled+0x31/0x80
 [81860083] mutex_lock_nested+0x183/0x440
 [817083af] ? register_pernet_subsys+0x1f/0x50
 [817083af] ? register_pernet_subsys+0x1f/0x50
 [a06f3000] ? 0xa06f3000
 [817083af] register_pernet_subsys+0x1f/0x50
 [a06f3048] br_init+0x48/0xd3 [bridge]
 [81002148] do_one_initcall+0xd8/0x210
 [81153c52] load_module+0x20c2/0x2870
 [8114ec30] ? store_uevent+0x70/0x70
 [8110ac76] ? lock_release_non_nested+0x3c6/0x3d0
 [811544e7] SyS_init_module+0xe7/0x140
 [81865329] system_call_fastpath+0x12/0x17
1 lock held by modprobe/1139:
 #0:  (net_mutex){+.+.+.}, at: [817083af] 
register_pernet_subsys+0x1f/0x50
INFO: task modprobe:1209 blocked for more than 120 seconds.
  Not tainted 3.18.0-0.rc1.git2.3.fc22.x86_64 #1
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Paul E. McKenney
On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
 On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
  On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti wrote:
   On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:
  
  [ . . . ]
  
 Don't get me wrong -- the fact that this kthread appears to 
 have
 blocked within rcu_barrier() for 120 seconds means that 
 something is
 most definitely wrong here.  I am surprised that there are no 
 RCU CPU
 stall warnings, but perhaps the blockage is in the callback 
 execution
 rather than grace-period completion.  Or something is 
 preventing this
 kthread from starting up after the wake-up callback executes.  
 Or...
 
 Is this thing reproducible?

I've added Yanko on CC, who reported the backtrace above and can
recreate it reliably.  Apparently reverting the RCU merge commit
(d6dd50e) and rebuilding the latest after that does not show the
issue.  I'll let Yanko explain more and answer any questions you 
have.
   
   - It is reproducible
   - I've done another build here to double check and its definitely 
   the rcu merge
 that's causing it.
   
   Don't think I'll be able to dig deeper, but I can do testing if 
   needed.
  
  Please!  Does the following patch help?
 
 Nope, doesn't seem to make a difference to the modprobe ppp_generic 
 test

Well, I was hoping.  I will take a closer look at the RCU merge commit
and see what suggests itself.  I am likely to ask you to revert specific
commits, if that works for you.

Thanx, Paul

 INFO: task kworker/u16:6:101 blocked for more than 120 seconds.
   Not tainted 3.18.0-0.rc1.git2.3.fc22.x86_64 #1
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
 message.
 kworker/u16:6   D 88022067cec0 11680   101  2 0x
 Workqueue: netns cleanup_net
  8802206939e8 0096 88022067cec0 001d5f00
  880220693fd8 001d5f00 880223263480 88022067cec0
  82c51d60 7fff 81ee2698 81ee2690
 Call Trace:
  [8185e289] schedule+0x29/0x70
  [818634ac] schedule_timeout+0x26c/0x410
  [81028c4a] ? native_sched_clock+0x2a/0xa0
  [81107afc] ? mark_held_locks+0x7c/0xb0
  [81864530] ? _raw_spin_unlock_irq+0x30/0x50
  [81107c8d] ? trace_hardirqs_on_caller+0x15d/0x200
  [8185fcbc] wait_for_completion+0x10c/0x150
  [810e5430] ? wake_up_state+0x20/0x20
  [8112a799] _rcu_barrier+0x159/0x200
  [8112a895] rcu_barrier+0x15/0x20
  [81718f0f] netdev_run_todo+0x6f/0x310
  [8170dad5] ? rollback_registered_many+0x265/0x2e0
  [81725f7e] rtnl_unlock+0xe/0x10
  [8170f936] default_device_exit_batch+0x156/0x180
  [810fd8f0] ? abort_exclusive_wait+0xb0/0xb0
  [817079e3] ops_exit_list.isra.1+0x53/0x60
  [81708590] cleanup_net+0x100/0x1f0
  [810ccff8] process_one_work+0x218/0x850
  [810ccf5f] ? process_one_work+0x17f/0x850
  [810cd717] ? worker_thread+0xe7/0x4a0
  [810cd69b] worker_thread+0x6b/0x4a0
  [810cd630] ? process_one_work+0x850/0x850
  [810d39eb] kthread+0x10b/0x130
  [81028cc9] ? sched_clock+0x9/0x10
  [810d38e0] ? kthread_create_on_node+0x250/0x250
  [8186527c] ret_from_fork+0x7c/0xb0
  [810d38e0] ? kthread_create_on_node+0x250/0x250
 4 locks held by kworker/u16:6/101:
  #0:  (%snetns){.+.+.+}, at: [810ccf5f] 
 process_one_work+0x17f/0x850
  #1:  (net_cleanup_work){+.+.+.}, at: [810ccf5f] 
 process_one_work+0x17f/0x850
  #2:  (net_mutex){+.+.+.}, at: [8170851c] cleanup_net+0x8c/0x1f0
  #3:  (rcu_sched_state.barrier_mutex){+.+...}, at: [8112a675] 
 _rcu_barrier+0x35/0x200
 INFO: task modprobe:1139 blocked for more than 120 seconds.
   Not tainted 3.18.0-0.rc1.git2.3.fc22.x86_64 #1
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
 message.
 modprobeD 880213ac1a40 13112  1139   1138 0x0080
  880036ab3be8 0096 880213ac1a40 001d5f00
  880036ab3fd8 001d5f00 880223264ec0 880213ac1a40
  880213ac1a40 81f8fb48 0246 880213ac1a40
 Call Trace:
  [8185e831] schedule_preempt_disabled+0x31/0x80
  [81860083] mutex_lock_nested+0x183/0x440
  [817083af] ? register_pernet_subsys+0x1f/0x50
  [817083af] ? register_pernet_subsys+0x1f/0x50
  [a06f3000] ? 0xa06f3000
  [817083af] register_pernet_subsys+0x1f/0x50
  [a06f3048] br_init+0x48/0xd3 [bridge]
  [81002148] do_one_initcall+0xd8/0x210
  [81153c52] load_module+0x20c2/0x2870
  [8114ec30] ? store_uevent+0x70/0x70
  [8110ac76] ? 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Paul E. McKenney
On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
 On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
  On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
   On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti wrote:
On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
 On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
   
   [ . . . ]
   
  Don't get me wrong -- the fact that this kthread appears to 
  have
  blocked within rcu_barrier() for 120 seconds means that 
  something is
  most definitely wrong here.  I am surprised that there are no 
  RCU CPU
  stall warnings, but perhaps the blockage is in the callback 
  execution
  rather than grace-period completion.  Or something is 
  preventing this
  kthread from starting up after the wake-up callback executes.  
  Or...
  
  Is this thing reproducible?
 
 I've added Yanko on CC, who reported the backtrace above and can
 recreate it reliably.  Apparently reverting the RCU merge commit
 (d6dd50e) and rebuilding the latest after that does not show the
 issue.  I'll let Yanko explain more and answer any questions you 
 have.

- It is reproducible
- I've done another build here to double check and its definitely 
the rcu merge
  that's causing it.

Don't think I'll be able to dig deeper, but I can do testing if 
needed.
   
   Please!  Does the following patch help?
  
  Nope, doesn't seem to make a difference to the modprobe ppp_generic 
  test
 
 Well, I was hoping.  I will take a closer look at the RCU merge commit
 and see what suggests itself.  I am likely to ask you to revert specific
 commits, if that works for you.

Well, rather than reverting commits, could you please try testing the
following commits?

11ed7f934cb8 (rcu: Make nocb leader kthreads process pending callbacks after 
spawning)

73a860cd58a1 (rcu: Replace flush_signals() with WARN_ON(signal_pending()))

c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())

For whatever it is worth, I am guessing this one.

a53dd6a65668 (rcutorture: Add RCU-tasks tests to default rcutorture list)

If any of the above fail, this one should also fail.

Also, could you please send along your .config?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Paul E. McKenney
On Thu, Oct 23, 2014 at 12:11:26PM -0400, Josh Boyer wrote:
 On Oct 23, 2014 11:37 AM, Paul E. McKenney paul...@linux.vnet.ibm.com
 wrote:
 
  On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
   On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
 On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti wrote:
  On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
   On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
   paul...@linux.vnet.ibm.com wrote:

 [ . . . ]

Don't get me wrong -- the fact that this kthread appears to
have
blocked within rcu_barrier() for 120 seconds means that
something is
most definitely wrong here.  I am surprised that there are no
RCU CPU
stall warnings, but perhaps the blockage is in the callback
execution
rather than grace-period completion.  Or something is
preventing this
kthread from starting up after the wake-up callback executes.
Or...
   
Is this thing reproducible?
  
   I've added Yanko on CC, who reported the backtrace above and can
   recreate it reliably.  Apparently reverting the RCU merge commit
   (d6dd50e) and rebuilding the latest after that does not show the
   issue.  I'll let Yanko explain more and answer any questions you
   have.
 
  - It is reproducible
  - I've done another build here to double check and its definitely
  the rcu merge
that's causing it.
 
  Don't think I'll be able to dig deeper, but I can do testing if
  needed.

 Please!  Does the following patch help?
   
Nope, doesn't seem to make a difference to the modprobe ppp_generic
test
  
   Well, I was hoping.  I will take a closer look at the RCU merge commit
   and see what suggests itself.  I am likely to ask you to revert specific
   commits, if that works for you.
 
  Well, rather than reverting commits, could you please try testing the
  following commits?
 
  11ed7f934cb8 (rcu: Make nocb leader kthreads process pending callbacks
 after spawning)
 
  73a860cd58a1 (rcu: Replace flush_signals() with WARN_ON(signal_pending()))
 
  c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())
 
  For whatever it is worth, I am guessing this one.
 
  a53dd6a65668 (rcutorture: Add RCU-tasks tests to default rcutorture list)
 
  If any of the above fail, this one should also fail.
 
  Also, could you please send along your .config?
 
 Which tree are those in?

They are all in Linus's tree.  They are topic branches of the RCU merge
commit (d6dd50e), and the test results will hopefully give me more of a
clue where to look.  As would the .config file.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Paul E. McKenney
On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
 On Thu-10/23/14-2014 08:33, Paul E. McKenney wrote:
  On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
   On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
 On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti wrote:
  On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
   On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
   paul...@linux.vnet.ibm.com wrote:
 
 [ . . . ]
 
Don't get me wrong -- the fact that this kthread appears to 
have
blocked within rcu_barrier() for 120 seconds means that 
something is
most definitely wrong here.  I am surprised that there are no 
RCU CPU
stall warnings, but perhaps the blockage is in the callback 
execution
rather than grace-period completion.  Or something is 
preventing this
kthread from starting up after the wake-up callback executes.  
Or...

Is this thing reproducible?
   
   I've added Yanko on CC, who reported the backtrace above and can
   recreate it reliably.  Apparently reverting the RCU merge commit
   (d6dd50e) and rebuilding the latest after that does not show the
   issue.  I'll let Yanko explain more and answer any questions you 
   have.
  
  - It is reproducible
  - I've done another build here to double check and its definitely 
  the rcu merge
that's causing it.
  
  Don't think I'll be able to dig deeper, but I can do testing if 
  needed.
 
 Please!  Does the following patch help?

Nope, doesn't seem to make a difference to the modprobe ppp_generic 
test
   
   Well, I was hoping.  I will take a closer look at the RCU merge commit
   and see what suggests itself.  I am likely to ask you to revert specific
   commits, if that works for you.
  
  Well, rather than reverting commits, could you please try testing the
  following commits?
  
  11ed7f934cb8 (rcu: Make nocb leader kthreads process pending callbacks 
  after spawning)
  
  73a860cd58a1 (rcu: Replace flush_signals() with WARN_ON(signal_pending()))
  
  c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())
  
  For whatever it is worth, I am guessing this one.
 
 Indeed, c847f14217d5 it is.
 
 Much to my embarrasment I just noticed that in addition to the
 rcu merge, triggering the bug requires my specific Fedora rawhide network
 setup. Booting in single mode and modprobe ppp_generic is fine. The bug
 appears when starting with my regular fedora network setup, which in my case 
 includes 3 ethernet adapters and a libvirt birdge+nat setup.
 
 Hope that helps. 
 
 I am attaching the config.

It does help a lot, thank you!!!

The following patch is a bit of a shot in the dark, and assumes that
commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled idle
code) introduced the problem.  Does this patch fix things up?

Thanx, Paul



rcu: Kick rcuo kthreads after their CPU goes offline

If a no-CBs CPU were to post an RCU callback with interrupts disabled
after it entered the idle loop for the last time, there might be no
deferred wakeup for the corresponding rcuo kthreads.  This commit
therefore adds a set of calls to do_nocb_deferred_wakeup() after the
CPU has gone completely offline.

Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 84b41b3c6ebd..4f3d25a58786 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3493,8 +3493,10 @@ static int rcu_cpu_notify(struct notifier_block *self,
case CPU_DEAD_FROZEN:
case CPU_UP_CANCELED:
case CPU_UP_CANCELED_FROZEN:
-   for_each_rcu_flavor(rsp)
+   for_each_rcu_flavor(rsp) {
rcu_cleanup_dead_cpu(cpu, rsp);
+   do_nocb_deferred_wakeup(this_cpu_ptr(rsp-rda));
+   }
break;
default:
break;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Yanko Kaneti

On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
 On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
  On Thu-10/23/14-2014 08:33, Paul E. McKenney wrote:
   On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
 On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
  On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti 
  wrote:
   On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:
  
  [ . . . ]
  
 Don't get me wrong -- the fact that this kthread 
 appears to
 have
 blocked within rcu_barrier() for 120 seconds means 
 that
 something is
 most definitely wrong here.  I am surprised that 
 there are no
 RCU CPU
 stall warnings, but perhaps the blockage is in the 
 callback
 execution
 rather than grace-period completion.  Or something is
 preventing this
 kthread from starting up after the wake-up callback 
 executes.
 Or...
 
 Is this thing reproducible?

I've added Yanko on CC, who reported the backtrace 
above and can
recreate it reliably.  Apparently reverting the RCU 
merge commit
(d6dd50e) and rebuilding the latest after that does 
not show the
issue.  I'll let Yanko explain more and answer any 
questions you
have.
   
   - It is reproducible
   - I've done another build here to double check and its 
   definitely
   the rcu merge
 that's causing it.
   
   Don't think I'll be able to dig deeper, but I can do 
   testing if
   needed.
  
  Please!  Does the following patch help?
 
 Nope, doesn't seem to make a difference to the modprobe 
 ppp_generic
 test

Well, I was hoping.  I will take a closer look at the RCU 
merge commit
and see what suggests itself.  I am likely to ask you to 
revert specific
commits, if that works for you.
   
   Well, rather than reverting commits, could you please try 
   testing the
   following commits?
   
   11ed7f934cb8 (rcu: Make nocb leader kthreads process pending 
   callbacks after spawning)
   
   73a860cd58a1 (rcu: Replace flush_signals() with 
   WARN_ON(signal_pending()))
   
   c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())
   
   For whatever it is worth, I am guessing this one.
  
  Indeed, c847f14217d5 it is.
  
  Much to my embarrasment I just noticed that in addition to the
  rcu merge, triggering the bug requires my specific Fedora 
  rawhide network
  setup. Booting in single mode and modprobe ppp_generic is fine. 
  The bug
  appears when starting with my regular fedora network setup, which 
  in my case
  includes 3 ethernet adapters and a libvirt birdge+nat setup.
  
  Hope that helps.
  
  I am attaching the config.
 
 It does help a lot, thank you!!!
 
 The following patch is a bit of a shot in the dark, and assumes that
 commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled 
 idle
 code) introduced the problem.  Does this patch fix things up?

Unfortunately not, This is linus-tip + patch


INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
  Not tainted 3.18.0-rc1+ #4
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
kworker/u16:6   D 8800ca84cec0 1116896  2 0x
Workqueue: netns cleanup_net
 8802218339e8 0096 8800ca84cec0 001d5f00
 880221833fd8 001d5f00 880223264ec0 8800ca84cec0
 82c52040 7fff 81ee2658 81ee2650
Call Trace:
 [8185b8e9] schedule+0x29/0x70
 [81860b0c] schedule_timeout+0x26c/0x410
 [81028bea] ? native_sched_clock+0x2a/0xa0
 [8110759c] ? mark_held_locks+0x7c/0xb0
 [81861b90] ? _raw_spin_unlock_irq+0x30/0x50
 [8110772d] ? trace_hardirqs_on_caller+0x15d/0x200
 [8185d31c] wait_for_completion+0x10c/0x150
 [810e4ed0] ? wake_up_state+0x20/0x20
 [8112a219] _rcu_barrier+0x159/0x200
 [8112a315] rcu_barrier+0x15/0x20
 [8171657f] netdev_run_todo+0x6f/0x310
 [8170b145] ? rollback_registered_many+0x265/0x2e0
 [817235ee] rtnl_unlock+0xe/0x10
 [8170cfa6] default_device_exit_batch+0x156/0x180
 [810fd390] ? abort_exclusive_wait+0xb0/0xb0
 [81705053] ops_exit_list.isra.1+0x53/0x60
 [81705c00] cleanup_net+0x100/0x1f0
 [810cca98] process_one_work+0x218/0x850
 [810cc9ff] ? process_one_work+0x17f/0x850
 [810cd1b7] ? worker_thread+0xe7/0x4a0
 [810cd13b] worker_thread+0x6b/0x4a0
 [810cd0d0] ? process_one_work+0x850/0x850
 [810d348b] kthread+0x10b/0x130
 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Paul E. McKenney
On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
 
 On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
  On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
   On Thu-10/23/14-2014 08:33, Paul E. McKenney wrote:
On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
 On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
  On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
   On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti 
   wrote:
On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
 On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
   
   [ . . . ]
   
  Don't get me wrong -- the fact that this kthread 
  appears to
  have
  blocked within rcu_barrier() for 120 seconds means 
  that
  something is
  most definitely wrong here.  I am surprised that 
  there are no
  RCU CPU
  stall warnings, but perhaps the blockage is in the 
  callback
  execution
  rather than grace-period completion.  Or something is
  preventing this
  kthread from starting up after the wake-up callback 
  executes.
  Or...
  
  Is this thing reproducible?
 
 I've added Yanko on CC, who reported the backtrace 
 above and can
 recreate it reliably.  Apparently reverting the RCU 
 merge commit
 (d6dd50e) and rebuilding the latest after that does 
 not show the
 issue.  I'll let Yanko explain more and answer any 
 questions you
 have.

- It is reproducible
- I've done another build here to double check and its 
definitely
the rcu merge
  that's causing it.

Don't think I'll be able to dig deeper, but I can do 
testing if
needed.
   
   Please!  Does the following patch help?
  
  Nope, doesn't seem to make a difference to the modprobe 
  ppp_generic
  test
 
 Well, I was hoping.  I will take a closer look at the RCU 
 merge commit
 and see what suggests itself.  I am likely to ask you to 
 revert specific
 commits, if that works for you.

Well, rather than reverting commits, could you please try 
testing the
following commits?

11ed7f934cb8 (rcu: Make nocb leader kthreads process pending 
callbacks after spawning)

73a860cd58a1 (rcu: Replace flush_signals() with 
WARN_ON(signal_pending()))

c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())

For whatever it is worth, I am guessing this one.
   
   Indeed, c847f14217d5 it is.
   
   Much to my embarrasment I just noticed that in addition to the
   rcu merge, triggering the bug requires my specific Fedora 
   rawhide network
   setup. Booting in single mode and modprobe ppp_generic is fine. 
   The bug
   appears when starting with my regular fedora network setup, which 
   in my case
   includes 3 ethernet adapters and a libvirt birdge+nat setup.
   
   Hope that helps.
   
   I am attaching the config.
  
  It does help a lot, thank you!!!
  
  The following patch is a bit of a shot in the dark, and assumes that
  commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled 
  idle
  code) introduced the problem.  Does this patch fix things up?
 
 Unfortunately not, This is linus-tip + patch

OK.  Can't have everything, I guess.

 INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
   Not tainted 3.18.0-rc1+ #4
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
 kworker/u16:6   D 8800ca84cec0 1116896  2 0x
 Workqueue: netns cleanup_net
  8802218339e8 0096 8800ca84cec0 001d5f00
  880221833fd8 001d5f00 880223264ec0 8800ca84cec0
  82c52040 7fff 81ee2658 81ee2650
 Call Trace:
  [8185b8e9] schedule+0x29/0x70
  [81860b0c] schedule_timeout+0x26c/0x410
  [81028bea] ? native_sched_clock+0x2a/0xa0
  [8110759c] ? mark_held_locks+0x7c/0xb0
  [81861b90] ? _raw_spin_unlock_irq+0x30/0x50
  [8110772d] ? trace_hardirqs_on_caller+0x15d/0x200
  [8185d31c] wait_for_completion+0x10c/0x150
  [810e4ed0] ? wake_up_state+0x20/0x20
  [8112a219] _rcu_barrier+0x159/0x200
  [8112a315] rcu_barrier+0x15/0x20
  [8171657f] netdev_run_todo+0x6f/0x310
  [8170b145] ? rollback_registered_many+0x265/0x2e0
  [817235ee] rtnl_unlock+0xe/0x10
  [8170cfa6] default_device_exit_batch+0x156/0x180
  [810fd390] ? abort_exclusive_wait+0xb0/0xb0
  [81705053] ops_exit_list.isra.1+0x53/0x60
  [81705c00] cleanup_net+0x100/0x1f0
  [810cca98] process_one_work+0x218/0x850
  

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Jay Vosburgh
Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
 
 On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
  On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
   On Thu-10/23/14-2014 08:33, Paul E. McKenney wrote:
On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
 On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
  On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
   On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti 
   wrote:
On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
 On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
   
   [ . . . ]
   
  Don't get me wrong -- the fact that this kthread 
  appears to
  have
  blocked within rcu_barrier() for 120 seconds means 
  that
  something is
  most definitely wrong here.  I am surprised that 
  there are no
  RCU CPU
  stall warnings, but perhaps the blockage is in the 
  callback
  execution
  rather than grace-period completion.  Or something is
  preventing this
  kthread from starting up after the wake-up callback 
  executes.
  Or...
  
  Is this thing reproducible?
 
 I've added Yanko on CC, who reported the backtrace 
 above and can
 recreate it reliably.  Apparently reverting the RCU 
 merge commit
 (d6dd50e) and rebuilding the latest after that does 
 not show the
 issue.  I'll let Yanko explain more and answer any 
 questions you
 have.

- It is reproducible
- I've done another build here to double check and its 
definitely
the rcu merge
  that's causing it.

Don't think I'll be able to dig deeper, but I can do 
testing if
needed.
   
   Please!  Does the following patch help?
  
  Nope, doesn't seem to make a difference to the modprobe 
  ppp_generic
  test
 
 Well, I was hoping.  I will take a closer look at the RCU 
 merge commit
 and see what suggests itself.  I am likely to ask you to 
 revert specific
 commits, if that works for you.

Well, rather than reverting commits, could you please try 
testing the
following commits?

11ed7f934cb8 (rcu: Make nocb leader kthreads process pending 
callbacks after spawning)

73a860cd58a1 (rcu: Replace flush_signals() with 
WARN_ON(signal_pending()))

c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())

For whatever it is worth, I am guessing this one.
   
   Indeed, c847f14217d5 it is.
   
   Much to my embarrasment I just noticed that in addition to the
   rcu merge, triggering the bug requires my specific Fedora 
   rawhide network
   setup. Booting in single mode and modprobe ppp_generic is fine. 
   The bug
   appears when starting with my regular fedora network setup, which 
   in my case
   includes 3 ethernet adapters and a libvirt birdge+nat setup.
   
   Hope that helps.
   
   I am attaching the config.
  
  It does help a lot, thank you!!!
  
  The following patch is a bit of a shot in the dark, and assumes that
  commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled 
  idle
  code) introduced the problem.  Does this patch fix things up?
 
 Unfortunately not, This is linus-tip + patch

OK.  Can't have everything, I guess.

 INFO: task kworker/u16:6:96 blocked for more than 120 seconds.
   Not tainted 3.18.0-rc1+ #4
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
 kworker/u16:6   D 8800ca84cec0 1116896  2 0x
 Workqueue: netns cleanup_net
  8802218339e8 0096 8800ca84cec0 001d5f00
  880221833fd8 001d5f00 880223264ec0 8800ca84cec0
  82c52040 7fff 81ee2658 81ee2650
 Call Trace:
  [8185b8e9] schedule+0x29/0x70
  [81860b0c] schedule_timeout+0x26c/0x410
  [81028bea] ? native_sched_clock+0x2a/0xa0
  [8110759c] ? mark_held_locks+0x7c/0xb0
  [81861b90] ? _raw_spin_unlock_irq+0x30/0x50
  [8110772d] ? trace_hardirqs_on_caller+0x15d/0x200
  [8185d31c] wait_for_completion+0x10c/0x150
  [810e4ed0] ? wake_up_state+0x20/0x20
  [8112a219] _rcu_barrier+0x159/0x200
  [8112a315] rcu_barrier+0x15/0x20
  [8171657f] netdev_run_todo+0x6f/0x310
  [8170b145] ? rollback_registered_many+0x265/0x2e0
  [817235ee] rtnl_unlock+0xe/0x10
  [8170cfa6] default_device_exit_batch+0x156/0x180
  [810fd390] ? abort_exclusive_wait+0xb0/0xb0
  [81705053] ops_exit_list.isra.1+0x53/0x60
  [81705c00] cleanup_net+0x100/0x1f0
  

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-22 Thread Paul E. McKenney
On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti wrote:
> On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
> > On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
> >  wrote:

[ . . . ]

> > > Don't get me wrong -- the fact that this kthread appears to have
> > > blocked within rcu_barrier() for 120 seconds means that something is
> > > most definitely wrong here.  I am surprised that there are no RCU CPU
> > > stall warnings, but perhaps the blockage is in the callback execution
> > > rather than grace-period completion.  Or something is preventing this
> > > kthread from starting up after the wake-up callback executes.  Or...
> > >
> > > Is this thing reproducible?
> > 
> > I've added Yanko on CC, who reported the backtrace above and can
> > recreate it reliably.  Apparently reverting the RCU merge commit
> > (d6dd50e) and rebuilding the latest after that does not show the
> > issue.  I'll let Yanko explain more and answer any questions you have.
> 
> - It is reproducible
> - I've done another build here to double check and its definitely the rcu 
> merge
>   that's causing it. 
> 
> Don't think I'll be able to dig deeper, but I can do testing if needed.

Please!  Does the following patch help?

Thanx, Paul



rcu: More on deadlock between CPU hotplug and expedited grace periods

Commit dd56af42bd82 (rcu: Eliminate deadlock between CPU hotplug and
expedited grace periods) was incomplete.  Although it did eliminate
deadlocks involving synchronize_sched_expedited()'s acquisition of
cpu_hotplug.lock via get_online_cpus(), it did nothing about the similar
deadlock involving acquisition of this same lock via put_online_cpus().
This deadlock became apparent with testing involving hibernation.

This commit therefore changes put_online_cpus() acquisition of this lock
to be conditional, and increments a new cpu_hotplug.puts_pending field
in case of acquisition failure.  Then cpu_hotplug_begin() checks for this
new field being non-zero, and applies any changes to cpu_hotplug.refcount.

Reported-by: Jiri Kosina 
Signed-off-by: Paul E. McKenney 
Tested-by: Jiri Kosina 

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 356450f09c1f..90a3d017b90c 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -64,6 +64,8 @@ static struct {
 * an ongoing cpu hotplug operation.
 */
int refcount;
+   /* And allows lockless put_online_cpus(). */
+   atomic_t puts_pending;
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
@@ -113,7 +115,11 @@ void put_online_cpus(void)
 {
if (cpu_hotplug.active_writer == current)
return;
-   mutex_lock(_hotplug.lock);
+   if (!mutex_trylock(_hotplug.lock)) {
+   atomic_inc(_hotplug.puts_pending);
+   cpuhp_lock_release();
+   return;
+   }
 
if (WARN_ON(!cpu_hotplug.refcount))
cpu_hotplug.refcount++; /* try to fix things up */
@@ -155,6 +161,12 @@ void cpu_hotplug_begin(void)
cpuhp_lock_acquire();
for (;;) {
mutex_lock(_hotplug.lock);
+   if (atomic_read(_hotplug.puts_pending)) {
+   int delta;
+
+   delta = atomic_xchg(_hotplug.puts_pending, 0);
+   cpu_hotplug.refcount -= delta;
+   }
if (likely(!cpu_hotplug.refcount))
break;
__set_current_state(TASK_UNINTERRUPTIBLE);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-22 Thread Yanko Kaneti
On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
> On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
>  wrote:
> > On Wed, Oct 22, 2014 at 01:25:37PM -0500, Eric W. Biederman wrote:
> >> "Paul E. McKenney"  writes:
> >>
> >> > On Wed, Oct 22, 2014 at 12:53:24PM -0500, Eric W. Biederman wrote:
> >> >> Cong Wang  writes:
> >> >>
> >> >> > (Adding Paul and Eric in Cc)
> >> >> >
> >> >> >
> >> >> > On Wed, Oct 22, 2014 at 10:12 AM, Josh Boyer 
> >> >> >  wrote:
> >> >> >>
> >> >> >> Someone else is seeing this when they try and modprobe ppp_generic:
> >> >> >>
> >> >> >> [  240.599195] INFO: task kworker/u16:5:100 blocked for more than 
> >> >> >> 120 seconds.
> >> >> >> [  240.599338]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
> >> >> >> [  240.599446] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> >> >> disables this message.
> >> >> >> [  240.599583] kworker/u16:5   D 8802202db480 12400   100  2 
> >> >> >> 0x
> >> >> >> [  240.599744] Workqueue: netns cleanup_net
> >> >> >> [  240.599823]  8802202eb9e8 0096 8802202db480
> >> >> >> 001d5f00
> >> >> >> [  240.600066]  8802202ebfd8 001d5f00 8800368c3480
> >> >> >> 8802202db480
> >> >> >> [  240.600228]  81ee2690 7fff 81ee2698
> >> >> >> 81ee2690
> >> >> >> [  240.600386] Call Trace:
> >> >> >> [  240.600445]  [] schedule+0x29/0x70
> >> >> >> [  240.600541]  [] schedule_timeout+0x26c/0x410
> >> >> >> [  240.600651]  [] ? retint_restore_args+0x13/0x13
> >> >> >> [  240.600765]  [] ? _raw_spin_unlock_irq+0x34/0x50
> >> >> >> [  240.600879]  [] wait_for_completion+0x10c/0x150
> >> >> >> [  240.601025]  [] ? wake_up_state+0x20/0x20
> >> >> >> [  240.601133]  [] _rcu_barrier+0x159/0x200
> >> >> >> [  240.601237]  [] rcu_barrier+0x15/0x20
> >> >> >> [  240.601335]  [] netdev_run_todo+0x6f/0x310
> >> >> >> [  240.601442]  [] ? 
> >> >> >> rollback_registered_many+0x265/0x2e0
> >> >> >> [  240.601564]  [] rtnl_unlock+0xe/0x10
> >> >> >> [  240.601660]  [] 
> >> >> >> default_device_exit_batch+0x156/0x180
> >> >> >> [  240.601781]  [] ? abort_exclusive_wait+0xb0/0xb0
> >> >> >> [  240.601895]  [] ops_exit_list.isra.1+0x53/0x60
> >> >> >> [  240.602028]  [] cleanup_net+0x100/0x1f0
> >> >> >> [  240.602131]  [] process_one_work+0x218/0x850
> >> >> >> [  240.602241]  [] ? process_one_work+0x17f/0x850
> >> >> >> [  240.602350]  [] ? worker_thread+0xe7/0x4a0
> >> >> >> [  240.602454]  [] worker_thread+0x6b/0x4a0
> >> >> >> [  240.602555]  [] ? process_one_work+0x850/0x850
> >> >> >> [  240.602665]  [] kthread+0x10b/0x130
> >> >> >> [  240.602762]  [] ? sched_clock+0x9/0x10
> >> >> >> [  240.602862]  [] ? 
> >> >> >> kthread_create_on_node+0x250/0x250
> >> >> >> [  240.603004]  [] ret_from_fork+0x7c/0xb0
> >> >> >> [  240.603106]  [] ? 
> >> >> >> kthread_create_on_node+0x250/0x250
> >> >> >> [  240.603224] 4 locks held by kworker/u16:5/100:
> >> >> >> [  240.603304]  #0:  ("%s""netns"){.+.+.+}, at: []
> >> >> >> process_one_work+0x17f/0x850
> >> >> >> [  240.603495]  #1:  (net_cleanup_work){+.+.+.}, at:
> >> >> >> [] process_one_work+0x17f/0x850
> >> >> >> [  240.603691]  #2:  (net_mutex){+.+.+.}, at: []
> >> >> >> cleanup_net+0x8c/0x1f0
> >> >> >> [  240.603869]  #3:  (rcu_sched_state.barrier_mutex){+.+...}, at:
> >> >> >> [] _rcu_barrier+0x35/0x200
> >> >> >> [  240.604211] INFO: task modprobe:1387 blocked for more than 120 
> >> >> >> seconds.
> >> >> >> [  240.604329]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
> >> >> >> [  240.604434] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> >> >> disables this message.
> >> >> >> [  240.604570] modprobeD 8800cb4f1a40 13112  1387   1386 
> >> >> >> 0x0080
> >> >> >> [  240.604719]  8800cafbbbe8 0096 8800cb4f1a40
> >> >> >> 001d5f00
> >> >> >> [  240.604878]  8800cafbbfd8 001d5f00 88022328
> >> >> >> 8800cb4f1a40
> >> >> >> [  240.605068]  8800cb4f1a40 81f8fb48 0246
> >> >> >> 8800cb4f1a40
> >> >> >> [  240.605228] Call Trace:
> >> >> >> [  240.605283]  [] 
> >> >> >> schedule_preempt_disabled+0x31/0x80
> >> >> >> [  240.605400]  [] mutex_lock_nested+0x183/0x440
> >> >> >> [  240.605510]  [] ? 
> >> >> >> register_pernet_subsys+0x1f/0x50
> >> >> >> [  240.605626]  [] ? 
> >> >> >> register_pernet_subsys+0x1f/0x50
> >> >> >> [  240.605757]  [] ? 0xa0701000
> >> >> >> [  240.605854]  [] register_pernet_subsys+0x1f/0x50
> >> >> >> [  240.606005]  [] br_init+0x48/0xd3 [bridge]
> >> >> >> [  240.606112]  [] do_one_initcall+0xd8/0x210
> >> >> >> [  240.606224]  [] load_module+0x20c2/0x2870
> >> >> >> [  240.606327]  [] ? store_uevent+0x70/0x70
> >> >> >> [  240.606433]  [] ? 
> >> >> >> lock_release_non_nested+0x3c6/0x3d0
> >> >> >> [  240.606557]  [] SyS_init_module+0xe7/0x140
> >> >> >> [  240.606664]  [] system_call_fastpath+0x12/0x17
> >> >> >> [  240.606773] 1 lock held by modprobe/1387:
> >> >> 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-22 Thread Josh Boyer
On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
 wrote:
> On Wed, Oct 22, 2014 at 01:25:37PM -0500, Eric W. Biederman wrote:
>> "Paul E. McKenney"  writes:
>>
>> > On Wed, Oct 22, 2014 at 12:53:24PM -0500, Eric W. Biederman wrote:
>> >> Cong Wang  writes:
>> >>
>> >> > (Adding Paul and Eric in Cc)
>> >> >
>> >> >
>> >> > On Wed, Oct 22, 2014 at 10:12 AM, Josh Boyer 
>> >> >  wrote:
>> >> >>
>> >> >> Someone else is seeing this when they try and modprobe ppp_generic:
>> >> >>
>> >> >> [  240.599195] INFO: task kworker/u16:5:100 blocked for more than 120 
>> >> >> seconds.
>> >> >> [  240.599338]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
>> >> >> [  240.599446] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> >> >> disables this message.
>> >> >> [  240.599583] kworker/u16:5   D 8802202db480 12400   100  2 
>> >> >> 0x
>> >> >> [  240.599744] Workqueue: netns cleanup_net
>> >> >> [  240.599823]  8802202eb9e8 0096 8802202db480
>> >> >> 001d5f00
>> >> >> [  240.600066]  8802202ebfd8 001d5f00 8800368c3480
>> >> >> 8802202db480
>> >> >> [  240.600228]  81ee2690 7fff 81ee2698
>> >> >> 81ee2690
>> >> >> [  240.600386] Call Trace:
>> >> >> [  240.600445]  [] schedule+0x29/0x70
>> >> >> [  240.600541]  [] schedule_timeout+0x26c/0x410
>> >> >> [  240.600651]  [] ? retint_restore_args+0x13/0x13
>> >> >> [  240.600765]  [] ? _raw_spin_unlock_irq+0x34/0x50
>> >> >> [  240.600879]  [] wait_for_completion+0x10c/0x150
>> >> >> [  240.601025]  [] ? wake_up_state+0x20/0x20
>> >> >> [  240.601133]  [] _rcu_barrier+0x159/0x200
>> >> >> [  240.601237]  [] rcu_barrier+0x15/0x20
>> >> >> [  240.601335]  [] netdev_run_todo+0x6f/0x310
>> >> >> [  240.601442]  [] ? 
>> >> >> rollback_registered_many+0x265/0x2e0
>> >> >> [  240.601564]  [] rtnl_unlock+0xe/0x10
>> >> >> [  240.601660]  [] 
>> >> >> default_device_exit_batch+0x156/0x180
>> >> >> [  240.601781]  [] ? abort_exclusive_wait+0xb0/0xb0
>> >> >> [  240.601895]  [] ops_exit_list.isra.1+0x53/0x60
>> >> >> [  240.602028]  [] cleanup_net+0x100/0x1f0
>> >> >> [  240.602131]  [] process_one_work+0x218/0x850
>> >> >> [  240.602241]  [] ? process_one_work+0x17f/0x850
>> >> >> [  240.602350]  [] ? worker_thread+0xe7/0x4a0
>> >> >> [  240.602454]  [] worker_thread+0x6b/0x4a0
>> >> >> [  240.602555]  [] ? process_one_work+0x850/0x850
>> >> >> [  240.602665]  [] kthread+0x10b/0x130
>> >> >> [  240.602762]  [] ? sched_clock+0x9/0x10
>> >> >> [  240.602862]  [] ? 
>> >> >> kthread_create_on_node+0x250/0x250
>> >> >> [  240.603004]  [] ret_from_fork+0x7c/0xb0
>> >> >> [  240.603106]  [] ? 
>> >> >> kthread_create_on_node+0x250/0x250
>> >> >> [  240.603224] 4 locks held by kworker/u16:5/100:
>> >> >> [  240.603304]  #0:  ("%s""netns"){.+.+.+}, at: []
>> >> >> process_one_work+0x17f/0x850
>> >> >> [  240.603495]  #1:  (net_cleanup_work){+.+.+.}, at:
>> >> >> [] process_one_work+0x17f/0x850
>> >> >> [  240.603691]  #2:  (net_mutex){+.+.+.}, at: []
>> >> >> cleanup_net+0x8c/0x1f0
>> >> >> [  240.603869]  #3:  (rcu_sched_state.barrier_mutex){+.+...}, at:
>> >> >> [] _rcu_barrier+0x35/0x200
>> >> >> [  240.604211] INFO: task modprobe:1387 blocked for more than 120 
>> >> >> seconds.
>> >> >> [  240.604329]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
>> >> >> [  240.604434] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> >> >> disables this message.
>> >> >> [  240.604570] modprobeD 8800cb4f1a40 13112  1387   1386 
>> >> >> 0x0080
>> >> >> [  240.604719]  8800cafbbbe8 0096 8800cb4f1a40
>> >> >> 001d5f00
>> >> >> [  240.604878]  8800cafbbfd8 001d5f00 88022328
>> >> >> 8800cb4f1a40
>> >> >> [  240.605068]  8800cb4f1a40 81f8fb48 0246
>> >> >> 8800cb4f1a40
>> >> >> [  240.605228] Call Trace:
>> >> >> [  240.605283]  [] 
>> >> >> schedule_preempt_disabled+0x31/0x80
>> >> >> [  240.605400]  [] mutex_lock_nested+0x183/0x440
>> >> >> [  240.605510]  [] ? register_pernet_subsys+0x1f/0x50
>> >> >> [  240.605626]  [] ? register_pernet_subsys+0x1f/0x50
>> >> >> [  240.605757]  [] ? 0xa0701000
>> >> >> [  240.605854]  [] register_pernet_subsys+0x1f/0x50
>> >> >> [  240.606005]  [] br_init+0x48/0xd3 [bridge]
>> >> >> [  240.606112]  [] do_one_initcall+0xd8/0x210
>> >> >> [  240.606224]  [] load_module+0x20c2/0x2870
>> >> >> [  240.606327]  [] ? store_uevent+0x70/0x70
>> >> >> [  240.606433]  [] ? 
>> >> >> lock_release_non_nested+0x3c6/0x3d0
>> >> >> [  240.606557]  [] SyS_init_module+0xe7/0x140
>> >> >> [  240.606664]  [] system_call_fastpath+0x12/0x17
>> >> >> [  240.606773] 1 lock held by modprobe/1387:
>> >> >> [  240.606845]  #0:  (net_mutex){+.+.+.}, at: []
>> >> >> register_pernet_subsys+0x1f/0x50
>> >> >> [  240.607114] INFO: task modprobe:1466 blocked for more than 120 
>> >> >> seconds.
>> >> >> [  240.607231]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
>> >> 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-22 Thread Paul E. McKenney
On Wed, Oct 22, 2014 at 01:25:37PM -0500, Eric W. Biederman wrote:
> "Paul E. McKenney"  writes:
> 
> > On Wed, Oct 22, 2014 at 12:53:24PM -0500, Eric W. Biederman wrote:
> >> Cong Wang  writes:
> >> 
> >> > (Adding Paul and Eric in Cc)
> >> >
> >> >
> >> > On Wed, Oct 22, 2014 at 10:12 AM, Josh Boyer  
> >> > wrote:
> >> >>
> >> >> Someone else is seeing this when they try and modprobe ppp_generic:
> >> >>
> >> >> [  240.599195] INFO: task kworker/u16:5:100 blocked for more than 120 
> >> >> seconds.
> >> >> [  240.599338]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
> >> >> [  240.599446] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> >> disables this message.
> >> >> [  240.599583] kworker/u16:5   D 8802202db480 12400   100  2 
> >> >> 0x
> >> >> [  240.599744] Workqueue: netns cleanup_net
> >> >> [  240.599823]  8802202eb9e8 0096 8802202db480
> >> >> 001d5f00
> >> >> [  240.600066]  8802202ebfd8 001d5f00 8800368c3480
> >> >> 8802202db480
> >> >> [  240.600228]  81ee2690 7fff 81ee2698
> >> >> 81ee2690
> >> >> [  240.600386] Call Trace:
> >> >> [  240.600445]  [] schedule+0x29/0x70
> >> >> [  240.600541]  [] schedule_timeout+0x26c/0x410
> >> >> [  240.600651]  [] ? retint_restore_args+0x13/0x13
> >> >> [  240.600765]  [] ? _raw_spin_unlock_irq+0x34/0x50
> >> >> [  240.600879]  [] wait_for_completion+0x10c/0x150
> >> >> [  240.601025]  [] ? wake_up_state+0x20/0x20
> >> >> [  240.601133]  [] _rcu_barrier+0x159/0x200
> >> >> [  240.601237]  [] rcu_barrier+0x15/0x20
> >> >> [  240.601335]  [] netdev_run_todo+0x6f/0x310
> >> >> [  240.601442]  [] ? 
> >> >> rollback_registered_many+0x265/0x2e0
> >> >> [  240.601564]  [] rtnl_unlock+0xe/0x10
> >> >> [  240.601660]  [] 
> >> >> default_device_exit_batch+0x156/0x180
> >> >> [  240.601781]  [] ? abort_exclusive_wait+0xb0/0xb0
> >> >> [  240.601895]  [] ops_exit_list.isra.1+0x53/0x60
> >> >> [  240.602028]  [] cleanup_net+0x100/0x1f0
> >> >> [  240.602131]  [] process_one_work+0x218/0x850
> >> >> [  240.602241]  [] ? process_one_work+0x17f/0x850
> >> >> [  240.602350]  [] ? worker_thread+0xe7/0x4a0
> >> >> [  240.602454]  [] worker_thread+0x6b/0x4a0
> >> >> [  240.602555]  [] ? process_one_work+0x850/0x850
> >> >> [  240.602665]  [] kthread+0x10b/0x130
> >> >> [  240.602762]  [] ? sched_clock+0x9/0x10
> >> >> [  240.602862]  [] ? 
> >> >> kthread_create_on_node+0x250/0x250
> >> >> [  240.603004]  [] ret_from_fork+0x7c/0xb0
> >> >> [  240.603106]  [] ? 
> >> >> kthread_create_on_node+0x250/0x250
> >> >> [  240.603224] 4 locks held by kworker/u16:5/100:
> >> >> [  240.603304]  #0:  ("%s""netns"){.+.+.+}, at: []
> >> >> process_one_work+0x17f/0x850
> >> >> [  240.603495]  #1:  (net_cleanup_work){+.+.+.}, at:
> >> >> [] process_one_work+0x17f/0x850
> >> >> [  240.603691]  #2:  (net_mutex){+.+.+.}, at: []
> >> >> cleanup_net+0x8c/0x1f0
> >> >> [  240.603869]  #3:  (rcu_sched_state.barrier_mutex){+.+...}, at:
> >> >> [] _rcu_barrier+0x35/0x200
> >> >> [  240.604211] INFO: task modprobe:1387 blocked for more than 120 
> >> >> seconds.
> >> >> [  240.604329]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
> >> >> [  240.604434] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> >> disables this message.
> >> >> [  240.604570] modprobeD 8800cb4f1a40 13112  1387   1386 
> >> >> 0x0080
> >> >> [  240.604719]  8800cafbbbe8 0096 8800cb4f1a40
> >> >> 001d5f00
> >> >> [  240.604878]  8800cafbbfd8 001d5f00 88022328
> >> >> 8800cb4f1a40
> >> >> [  240.605068]  8800cb4f1a40 81f8fb48 0246
> >> >> 8800cb4f1a40
> >> >> [  240.605228] Call Trace:
> >> >> [  240.605283]  [] schedule_preempt_disabled+0x31/0x80
> >> >> [  240.605400]  [] mutex_lock_nested+0x183/0x440
> >> >> [  240.605510]  [] ? register_pernet_subsys+0x1f/0x50
> >> >> [  240.605626]  [] ? register_pernet_subsys+0x1f/0x50
> >> >> [  240.605757]  [] ? 0xa0701000
> >> >> [  240.605854]  [] register_pernet_subsys+0x1f/0x50
> >> >> [  240.606005]  [] br_init+0x48/0xd3 [bridge]
> >> >> [  240.606112]  [] do_one_initcall+0xd8/0x210
> >> >> [  240.606224]  [] load_module+0x20c2/0x2870
> >> >> [  240.606327]  [] ? store_uevent+0x70/0x70
> >> >> [  240.606433]  [] ? 
> >> >> lock_release_non_nested+0x3c6/0x3d0
> >> >> [  240.606557]  [] SyS_init_module+0xe7/0x140
> >> >> [  240.606664]  [] system_call_fastpath+0x12/0x17
> >> >> [  240.606773] 1 lock held by modprobe/1387:
> >> >> [  240.606845]  #0:  (net_mutex){+.+.+.}, at: []
> >> >> register_pernet_subsys+0x1f/0x50
> >> >> [  240.607114] INFO: task modprobe:1466 blocked for more than 120 
> >> >> seconds.
> >> >> [  240.607231]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
> >> >> [  240.607337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> >> disables this message.
> >> >> [  240.607473] modprobeD 88020fbab480 13096  

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-22 Thread Eric W. Biederman
"Paul E. McKenney"  writes:

> On Wed, Oct 22, 2014 at 12:53:24PM -0500, Eric W. Biederman wrote:
>> Cong Wang  writes:
>> 
>> > (Adding Paul and Eric in Cc)
>> >
>> >
>> > On Wed, Oct 22, 2014 at 10:12 AM, Josh Boyer  
>> > wrote:
>> >>
>> >> Someone else is seeing this when they try and modprobe ppp_generic:
>> >>
>> >> [  240.599195] INFO: task kworker/u16:5:100 blocked for more than 120 
>> >> seconds.
>> >> [  240.599338]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
>> >> [  240.599446] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> >> disables this message.
>> >> [  240.599583] kworker/u16:5   D 8802202db480 12400   100  2 
>> >> 0x
>> >> [  240.599744] Workqueue: netns cleanup_net
>> >> [  240.599823]  8802202eb9e8 0096 8802202db480
>> >> 001d5f00
>> >> [  240.600066]  8802202ebfd8 001d5f00 8800368c3480
>> >> 8802202db480
>> >> [  240.600228]  81ee2690 7fff 81ee2698
>> >> 81ee2690
>> >> [  240.600386] Call Trace:
>> >> [  240.600445]  [] schedule+0x29/0x70
>> >> [  240.600541]  [] schedule_timeout+0x26c/0x410
>> >> [  240.600651]  [] ? retint_restore_args+0x13/0x13
>> >> [  240.600765]  [] ? _raw_spin_unlock_irq+0x34/0x50
>> >> [  240.600879]  [] wait_for_completion+0x10c/0x150
>> >> [  240.601025]  [] ? wake_up_state+0x20/0x20
>> >> [  240.601133]  [] _rcu_barrier+0x159/0x200
>> >> [  240.601237]  [] rcu_barrier+0x15/0x20
>> >> [  240.601335]  [] netdev_run_todo+0x6f/0x310
>> >> [  240.601442]  [] ? 
>> >> rollback_registered_many+0x265/0x2e0
>> >> [  240.601564]  [] rtnl_unlock+0xe/0x10
>> >> [  240.601660]  [] default_device_exit_batch+0x156/0x180
>> >> [  240.601781]  [] ? abort_exclusive_wait+0xb0/0xb0
>> >> [  240.601895]  [] ops_exit_list.isra.1+0x53/0x60
>> >> [  240.602028]  [] cleanup_net+0x100/0x1f0
>> >> [  240.602131]  [] process_one_work+0x218/0x850
>> >> [  240.602241]  [] ? process_one_work+0x17f/0x850
>> >> [  240.602350]  [] ? worker_thread+0xe7/0x4a0
>> >> [  240.602454]  [] worker_thread+0x6b/0x4a0
>> >> [  240.602555]  [] ? process_one_work+0x850/0x850
>> >> [  240.602665]  [] kthread+0x10b/0x130
>> >> [  240.602762]  [] ? sched_clock+0x9/0x10
>> >> [  240.602862]  [] ? kthread_create_on_node+0x250/0x250
>> >> [  240.603004]  [] ret_from_fork+0x7c/0xb0
>> >> [  240.603106]  [] ? kthread_create_on_node+0x250/0x250
>> >> [  240.603224] 4 locks held by kworker/u16:5/100:
>> >> [  240.603304]  #0:  ("%s""netns"){.+.+.+}, at: []
>> >> process_one_work+0x17f/0x850
>> >> [  240.603495]  #1:  (net_cleanup_work){+.+.+.}, at:
>> >> [] process_one_work+0x17f/0x850
>> >> [  240.603691]  #2:  (net_mutex){+.+.+.}, at: []
>> >> cleanup_net+0x8c/0x1f0
>> >> [  240.603869]  #3:  (rcu_sched_state.barrier_mutex){+.+...}, at:
>> >> [] _rcu_barrier+0x35/0x200
>> >> [  240.604211] INFO: task modprobe:1387 blocked for more than 120 seconds.
>> >> [  240.604329]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
>> >> [  240.604434] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> >> disables this message.
>> >> [  240.604570] modprobeD 8800cb4f1a40 13112  1387   1386 
>> >> 0x0080
>> >> [  240.604719]  8800cafbbbe8 0096 8800cb4f1a40
>> >> 001d5f00
>> >> [  240.604878]  8800cafbbfd8 001d5f00 88022328
>> >> 8800cb4f1a40
>> >> [  240.605068]  8800cb4f1a40 81f8fb48 0246
>> >> 8800cb4f1a40
>> >> [  240.605228] Call Trace:
>> >> [  240.605283]  [] schedule_preempt_disabled+0x31/0x80
>> >> [  240.605400]  [] mutex_lock_nested+0x183/0x440
>> >> [  240.605510]  [] ? register_pernet_subsys+0x1f/0x50
>> >> [  240.605626]  [] ? register_pernet_subsys+0x1f/0x50
>> >> [  240.605757]  [] ? 0xa0701000
>> >> [  240.605854]  [] register_pernet_subsys+0x1f/0x50
>> >> [  240.606005]  [] br_init+0x48/0xd3 [bridge]
>> >> [  240.606112]  [] do_one_initcall+0xd8/0x210
>> >> [  240.606224]  [] load_module+0x20c2/0x2870
>> >> [  240.606327]  [] ? store_uevent+0x70/0x70
>> >> [  240.606433]  [] ? lock_release_non_nested+0x3c6/0x3d0
>> >> [  240.606557]  [] SyS_init_module+0xe7/0x140
>> >> [  240.606664]  [] system_call_fastpath+0x12/0x17
>> >> [  240.606773] 1 lock held by modprobe/1387:
>> >> [  240.606845]  #0:  (net_mutex){+.+.+.}, at: []
>> >> register_pernet_subsys+0x1f/0x50
>> >> [  240.607114] INFO: task modprobe:1466 blocked for more than 120 seconds.
>> >> [  240.607231]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
>> >> [  240.607337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> >> disables this message.
>> >> [  240.607473] modprobeD 88020fbab480 13096  1466   1399 
>> >> 0x0084
>> >> [  240.607622]  88020d1bbbe8 0096 88020fbab480
>> >> 001d5f00
>> >> [  240.607791]  88020d1bbfd8 001d5f00 81e1b580
>> >> 88020fbab480
>> >> [  240.607949]  88020fbab480 81f8fb48 0246
>> >> 88020fbab480

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-22 Thread Paul E. McKenney
On Wed, Oct 22, 2014 at 12:53:24PM -0500, Eric W. Biederman wrote:
> Cong Wang  writes:
> 
> > (Adding Paul and Eric in Cc)
> >
> >
> > On Wed, Oct 22, 2014 at 10:12 AM, Josh Boyer  
> > wrote:
> >>
> >> Someone else is seeing this when they try and modprobe ppp_generic:
> >>
> >> [  240.599195] INFO: task kworker/u16:5:100 blocked for more than 120 
> >> seconds.
> >> [  240.599338]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
> >> [  240.599446] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> disables this message.
> >> [  240.599583] kworker/u16:5   D 8802202db480 12400   100  2 
> >> 0x
> >> [  240.599744] Workqueue: netns cleanup_net
> >> [  240.599823]  8802202eb9e8 0096 8802202db480
> >> 001d5f00
> >> [  240.600066]  8802202ebfd8 001d5f00 8800368c3480
> >> 8802202db480
> >> [  240.600228]  81ee2690 7fff 81ee2698
> >> 81ee2690
> >> [  240.600386] Call Trace:
> >> [  240.600445]  [] schedule+0x29/0x70
> >> [  240.600541]  [] schedule_timeout+0x26c/0x410
> >> [  240.600651]  [] ? retint_restore_args+0x13/0x13
> >> [  240.600765]  [] ? _raw_spin_unlock_irq+0x34/0x50
> >> [  240.600879]  [] wait_for_completion+0x10c/0x150
> >> [  240.601025]  [] ? wake_up_state+0x20/0x20
> >> [  240.601133]  [] _rcu_barrier+0x159/0x200
> >> [  240.601237]  [] rcu_barrier+0x15/0x20
> >> [  240.601335]  [] netdev_run_todo+0x6f/0x310
> >> [  240.601442]  [] ? rollback_registered_many+0x265/0x2e0
> >> [  240.601564]  [] rtnl_unlock+0xe/0x10
> >> [  240.601660]  [] default_device_exit_batch+0x156/0x180
> >> [  240.601781]  [] ? abort_exclusive_wait+0xb0/0xb0
> >> [  240.601895]  [] ops_exit_list.isra.1+0x53/0x60
> >> [  240.602028]  [] cleanup_net+0x100/0x1f0
> >> [  240.602131]  [] process_one_work+0x218/0x850
> >> [  240.602241]  [] ? process_one_work+0x17f/0x850
> >> [  240.602350]  [] ? worker_thread+0xe7/0x4a0
> >> [  240.602454]  [] worker_thread+0x6b/0x4a0
> >> [  240.602555]  [] ? process_one_work+0x850/0x850
> >> [  240.602665]  [] kthread+0x10b/0x130
> >> [  240.602762]  [] ? sched_clock+0x9/0x10
> >> [  240.602862]  [] ? kthread_create_on_node+0x250/0x250
> >> [  240.603004]  [] ret_from_fork+0x7c/0xb0
> >> [  240.603106]  [] ? kthread_create_on_node+0x250/0x250
> >> [  240.603224] 4 locks held by kworker/u16:5/100:
> >> [  240.603304]  #0:  ("%s""netns"){.+.+.+}, at: []
> >> process_one_work+0x17f/0x850
> >> [  240.603495]  #1:  (net_cleanup_work){+.+.+.}, at:
> >> [] process_one_work+0x17f/0x850
> >> [  240.603691]  #2:  (net_mutex){+.+.+.}, at: []
> >> cleanup_net+0x8c/0x1f0
> >> [  240.603869]  #3:  (rcu_sched_state.barrier_mutex){+.+...}, at:
> >> [] _rcu_barrier+0x35/0x200
> >> [  240.604211] INFO: task modprobe:1387 blocked for more than 120 seconds.
> >> [  240.604329]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
> >> [  240.604434] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> disables this message.
> >> [  240.604570] modprobeD 8800cb4f1a40 13112  1387   1386 
> >> 0x0080
> >> [  240.604719]  8800cafbbbe8 0096 8800cb4f1a40
> >> 001d5f00
> >> [  240.604878]  8800cafbbfd8 001d5f00 88022328
> >> 8800cb4f1a40
> >> [  240.605068]  8800cb4f1a40 81f8fb48 0246
> >> 8800cb4f1a40
> >> [  240.605228] Call Trace:
> >> [  240.605283]  [] schedule_preempt_disabled+0x31/0x80
> >> [  240.605400]  [] mutex_lock_nested+0x183/0x440
> >> [  240.605510]  [] ? register_pernet_subsys+0x1f/0x50
> >> [  240.605626]  [] ? register_pernet_subsys+0x1f/0x50
> >> [  240.605757]  [] ? 0xa0701000
> >> [  240.605854]  [] register_pernet_subsys+0x1f/0x50
> >> [  240.606005]  [] br_init+0x48/0xd3 [bridge]
> >> [  240.606112]  [] do_one_initcall+0xd8/0x210
> >> [  240.606224]  [] load_module+0x20c2/0x2870
> >> [  240.606327]  [] ? store_uevent+0x70/0x70
> >> [  240.606433]  [] ? lock_release_non_nested+0x3c6/0x3d0
> >> [  240.606557]  [] SyS_init_module+0xe7/0x140
> >> [  240.606664]  [] system_call_fastpath+0x12/0x17
> >> [  240.606773] 1 lock held by modprobe/1387:
> >> [  240.606845]  #0:  (net_mutex){+.+.+.}, at: []
> >> register_pernet_subsys+0x1f/0x50
> >> [  240.607114] INFO: task modprobe:1466 blocked for more than 120 seconds.
> >> [  240.607231]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
> >> [  240.607337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> disables this message.
> >> [  240.607473] modprobeD 88020fbab480 13096  1466   1399 
> >> 0x0084
> >> [  240.607622]  88020d1bbbe8 0096 88020fbab480
> >> 001d5f00
> >> [  240.607791]  88020d1bbfd8 001d5f00 81e1b580
> >> 88020fbab480
> >> [  240.607949]  88020fbab480 81f8fb48 0246
> >> 88020fbab480
> >> [  240.608138] Call Trace:
> >> [  240.608193]  [] schedule_preempt_disabled+0x31/0x80
> >> [  240.608316]  [] 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-22 Thread Josh Boyer
On Wed, Oct 22, 2014 at 1:59 PM, Paul E. McKenney
 wrote:
> On Wed, Oct 22, 2014 at 10:37:53AM -0700, Cong Wang wrote:
>> (Adding Paul and Eric in Cc)
>>
>> I am not aware of any change in net/core/dev.c related here,
>> so I guess it's a bug in rcu_barrier().
>>
>> Thanks.
>
> Does commit 789cbbeca4e (workqueue: Add quiescent state between work items)
> and 3e28e3772 (workqueue: Use cond_resched_rcu_qs macro) help this?

I don't believe so.  The output below is from a post 3.18-rc1 kernel
(Linux v3.18-rc1-221-gc3351dfabf5c to be exact), and both of those
commits are included in that if I'm reading the git output correctly.

josh

>> On Wed, Oct 22, 2014 at 10:12 AM, Josh Boyer  
>> wrote:
>> >
>> > Someone else is seeing this when they try and modprobe ppp_generic:
>> >
>> > [  240.599195] INFO: task kworker/u16:5:100 blocked for more than 120 
>> > seconds.
>> > [  240.599338]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
>> > [  240.599446] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> > disables this message.
>> > [  240.599583] kworker/u16:5   D 8802202db480 12400   100  2 
>> > 0x
>> > [  240.599744] Workqueue: netns cleanup_net
>> > [  240.599823]  8802202eb9e8 0096 8802202db480
>> > 001d5f00
>> > [  240.600066]  8802202ebfd8 001d5f00 8800368c3480
>> > 8802202db480
>> > [  240.600228]  81ee2690 7fff 81ee2698
>> > 81ee2690
>> > [  240.600386] Call Trace:
>> > [  240.600445]  [] schedule+0x29/0x70
>> > [  240.600541]  [] schedule_timeout+0x26c/0x410
>> > [  240.600651]  [] ? retint_restore_args+0x13/0x13
>> > [  240.600765]  [] ? _raw_spin_unlock_irq+0x34/0x50
>> > [  240.600879]  [] wait_for_completion+0x10c/0x150
>> > [  240.601025]  [] ? wake_up_state+0x20/0x20
>> > [  240.601133]  [] _rcu_barrier+0x159/0x200
>> > [  240.601237]  [] rcu_barrier+0x15/0x20
>> > [  240.601335]  [] netdev_run_todo+0x6f/0x310
>> > [  240.601442]  [] ? rollback_registered_many+0x265/0x2e0
>> > [  240.601564]  [] rtnl_unlock+0xe/0x10
>> > [  240.601660]  [] default_device_exit_batch+0x156/0x180
>> > [  240.601781]  [] ? abort_exclusive_wait+0xb0/0xb0
>> > [  240.601895]  [] ops_exit_list.isra.1+0x53/0x60
>> > [  240.602028]  [] cleanup_net+0x100/0x1f0
>> > [  240.602131]  [] process_one_work+0x218/0x850
>> > [  240.602241]  [] ? process_one_work+0x17f/0x850
>> > [  240.602350]  [] ? worker_thread+0xe7/0x4a0
>> > [  240.602454]  [] worker_thread+0x6b/0x4a0
>> > [  240.602555]  [] ? process_one_work+0x850/0x850
>> > [  240.602665]  [] kthread+0x10b/0x130
>> > [  240.602762]  [] ? sched_clock+0x9/0x10
>> > [  240.602862]  [] ? kthread_create_on_node+0x250/0x250
>> > [  240.603004]  [] ret_from_fork+0x7c/0xb0
>> > [  240.603106]  [] ? kthread_create_on_node+0x250/0x250
>> > [  240.603224] 4 locks held by kworker/u16:5/100:
>> > [  240.603304]  #0:  ("%s""netns"){.+.+.+}, at: []
>> > process_one_work+0x17f/0x850
>> > [  240.603495]  #1:  (net_cleanup_work){+.+.+.}, at:
>> > [] process_one_work+0x17f/0x850
>> > [  240.603691]  #2:  (net_mutex){+.+.+.}, at: []
>> > cleanup_net+0x8c/0x1f0
>> > [  240.603869]  #3:  (rcu_sched_state.barrier_mutex){+.+...}, at:
>> > [] _rcu_barrier+0x35/0x200
>> > [  240.604211] INFO: task modprobe:1387 blocked for more than 120 seconds.
>> > [  240.604329]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
>> > [  240.604434] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> > disables this message.
>> > [  240.604570] modprobeD 8800cb4f1a40 13112  1387   1386 
>> > 0x0080
>> > [  240.604719]  8800cafbbbe8 0096 8800cb4f1a40
>> > 001d5f00
>> > [  240.604878]  8800cafbbfd8 001d5f00 88022328
>> > 8800cb4f1a40
>> > [  240.605068]  8800cb4f1a40 81f8fb48 0246
>> > 8800cb4f1a40
>> > [  240.605228] Call Trace:
>> > [  240.605283]  [] schedule_preempt_disabled+0x31/0x80
>> > [  240.605400]  [] mutex_lock_nested+0x183/0x440
>> > [  240.605510]  [] ? register_pernet_subsys+0x1f/0x50
>> > [  240.605626]  [] ? register_pernet_subsys+0x1f/0x50
>> > [  240.605757]  [] ? 0xa0701000
>> > [  240.605854]  [] register_pernet_subsys+0x1f/0x50
>> > [  240.606005]  [] br_init+0x48/0xd3 [bridge]
>> > [  240.606112]  [] do_one_initcall+0xd8/0x210
>> > [  240.606224]  [] load_module+0x20c2/0x2870
>> > [  240.606327]  [] ? store_uevent+0x70/0x70
>> > [  240.606433]  [] ? lock_release_non_nested+0x3c6/0x3d0
>> > [  240.606557]  [] SyS_init_module+0xe7/0x140
>> > [  240.606664]  [] system_call_fastpath+0x12/0x17
>> > [  240.606773] 1 lock held by modprobe/1387:
>> > [  240.606845]  #0:  (net_mutex){+.+.+.}, at: []
>> > register_pernet_subsys+0x1f/0x50
>> > [  240.607114] INFO: task modprobe:1466 blocked for more than 120 seconds.
>> > [  240.607231]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
>> > [  240.607337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> > disables this message.
>> > [ 

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-22 Thread Paul E. McKenney
On Wed, Oct 22, 2014 at 10:37:53AM -0700, Cong Wang wrote:
> (Adding Paul and Eric in Cc)
> 
> I am not aware of any change in net/core/dev.c related here,
> so I guess it's a bug in rcu_barrier().
> 
> Thanks.

Does commit 789cbbeca4e (workqueue: Add quiescent state between work items)
and 3e28e3772 (workqueue: Use cond_resched_rcu_qs macro) help this?

Thanx, Paul

> On Wed, Oct 22, 2014 at 10:12 AM, Josh Boyer  
> wrote:
> >
> > Someone else is seeing this when they try and modprobe ppp_generic:
> >
> > [  240.599195] INFO: task kworker/u16:5:100 blocked for more than 120 
> > seconds.
> > [  240.599338]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
> > [  240.599446] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  240.599583] kworker/u16:5   D 8802202db480 12400   100  2 
> > 0x
> > [  240.599744] Workqueue: netns cleanup_net
> > [  240.599823]  8802202eb9e8 0096 8802202db480
> > 001d5f00
> > [  240.600066]  8802202ebfd8 001d5f00 8800368c3480
> > 8802202db480
> > [  240.600228]  81ee2690 7fff 81ee2698
> > 81ee2690
> > [  240.600386] Call Trace:
> > [  240.600445]  [] schedule+0x29/0x70
> > [  240.600541]  [] schedule_timeout+0x26c/0x410
> > [  240.600651]  [] ? retint_restore_args+0x13/0x13
> > [  240.600765]  [] ? _raw_spin_unlock_irq+0x34/0x50
> > [  240.600879]  [] wait_for_completion+0x10c/0x150
> > [  240.601025]  [] ? wake_up_state+0x20/0x20
> > [  240.601133]  [] _rcu_barrier+0x159/0x200
> > [  240.601237]  [] rcu_barrier+0x15/0x20
> > [  240.601335]  [] netdev_run_todo+0x6f/0x310
> > [  240.601442]  [] ? rollback_registered_many+0x265/0x2e0
> > [  240.601564]  [] rtnl_unlock+0xe/0x10
> > [  240.601660]  [] default_device_exit_batch+0x156/0x180
> > [  240.601781]  [] ? abort_exclusive_wait+0xb0/0xb0
> > [  240.601895]  [] ops_exit_list.isra.1+0x53/0x60
> > [  240.602028]  [] cleanup_net+0x100/0x1f0
> > [  240.602131]  [] process_one_work+0x218/0x850
> > [  240.602241]  [] ? process_one_work+0x17f/0x850
> > [  240.602350]  [] ? worker_thread+0xe7/0x4a0
> > [  240.602454]  [] worker_thread+0x6b/0x4a0
> > [  240.602555]  [] ? process_one_work+0x850/0x850
> > [  240.602665]  [] kthread+0x10b/0x130
> > [  240.602762]  [] ? sched_clock+0x9/0x10
> > [  240.602862]  [] ? kthread_create_on_node+0x250/0x250
> > [  240.603004]  [] ret_from_fork+0x7c/0xb0
> > [  240.603106]  [] ? kthread_create_on_node+0x250/0x250
> > [  240.603224] 4 locks held by kworker/u16:5/100:
> > [  240.603304]  #0:  ("%s""netns"){.+.+.+}, at: []
> > process_one_work+0x17f/0x850
> > [  240.603495]  #1:  (net_cleanup_work){+.+.+.}, at:
> > [] process_one_work+0x17f/0x850
> > [  240.603691]  #2:  (net_mutex){+.+.+.}, at: []
> > cleanup_net+0x8c/0x1f0
> > [  240.603869]  #3:  (rcu_sched_state.barrier_mutex){+.+...}, at:
> > [] _rcu_barrier+0x35/0x200
> > [  240.604211] INFO: task modprobe:1387 blocked for more than 120 seconds.
> > [  240.604329]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
> > [  240.604434] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  240.604570] modprobeD 8800cb4f1a40 13112  1387   1386 
> > 0x0080
> > [  240.604719]  8800cafbbbe8 0096 8800cb4f1a40
> > 001d5f00
> > [  240.604878]  8800cafbbfd8 001d5f00 88022328
> > 8800cb4f1a40
> > [  240.605068]  8800cb4f1a40 81f8fb48 0246
> > 8800cb4f1a40
> > [  240.605228] Call Trace:
> > [  240.605283]  [] schedule_preempt_disabled+0x31/0x80
> > [  240.605400]  [] mutex_lock_nested+0x183/0x440
> > [  240.605510]  [] ? register_pernet_subsys+0x1f/0x50
> > [  240.605626]  [] ? register_pernet_subsys+0x1f/0x50
> > [  240.605757]  [] ? 0xa0701000
> > [  240.605854]  [] register_pernet_subsys+0x1f/0x50
> > [  240.606005]  [] br_init+0x48/0xd3 [bridge]
> > [  240.606112]  [] do_one_initcall+0xd8/0x210
> > [  240.606224]  [] load_module+0x20c2/0x2870
> > [  240.606327]  [] ? store_uevent+0x70/0x70
> > [  240.606433]  [] ? lock_release_non_nested+0x3c6/0x3d0
> > [  240.606557]  [] SyS_init_module+0xe7/0x140
> > [  240.606664]  [] system_call_fastpath+0x12/0x17
> > [  240.606773] 1 lock held by modprobe/1387:
> > [  240.606845]  #0:  (net_mutex){+.+.+.}, at: []
> > register_pernet_subsys+0x1f/0x50
> > [  240.607114] INFO: task modprobe:1466 blocked for more than 120 seconds.
> > [  240.607231]   Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
> > [  240.607337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  240.607473] modprobeD 88020fbab480 13096  1466   1399 
> > 0x0084
> > [  240.607622]  88020d1bbbe8 0096 88020fbab480
> > 001d5f00
> > [  240.607791]  88020d1bbfd8 001d5f00 81e1b580
> > 88020fbab480
> > [  240.607949]  

  1   2   >