Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-11 Thread Jarek Poplawski
On Wed, Jan 10, 2007 at 11:40:35PM -0800, David Miller wrote: From: Jarek Poplawski [EMAIL PROTECTED] Date: Thu, 11 Jan 2007 08:24:28 +0100 Yesterday I did what I should do earlier - checked this simple way, with printk, and now I have no doubts it's a bug: if you add or remove vlan

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-11 Thread Jarek Poplawski
On Thu, Jan 11, 2007 at 09:29:58AM +0100, Jarek Poplawski wrote: On Wed, Jan 10, 2007 at 11:40:35PM -0800, David Miller wrote: From: Jarek Poplawski [EMAIL PROTECTED] Date: Thu, 11 Jan 2007 08:24:28 +0100 Yesterday I did what I should do earlier - checked this simple way, with

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-11 Thread Jarek Poplawski
On Thu, Jan 11, 2007 at 09:35:26AM +0100, Jarek Poplawski wrote: On Thu, Jan 11, 2007 at 09:29:58AM +0100, Jarek Poplawski wrote: On Wed, Jan 10, 2007 at 11:40:35PM -0800, David Miller wrote: From: Jarek Poplawski [EMAIL PROTECTED] Date: Thu, 11 Jan 2007 08:24:28 +0100 Yesterday I

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-11 Thread Jarek Poplawski
On Thu, Jan 11, 2007 at 01:27:55AM -0800, David Miller wrote: From: Jarek Poplawski [EMAIL PROTECTED] Date: Thu, 11 Jan 2007 09:39:34 +0100 Sure, but is this even legal to be preempted during reading or modifying rcu list or be blocked while holding rcu protected pointer? Doesn't this

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-10 Thread Jarek Poplawski
On Tue, Jan 09, 2007 at 09:10:45AM +0100, Jarek Poplawski wrote: On Mon, Jan 08, 2007 at 10:03:50AM -0800, Stephen Hemminger wrote: ... * Must be invoked with RCU read lock (no preempt) */ struct net_device *__find_vlan_dev(struct net_device *real_dev, ... But later in

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-10 Thread Jarek Poplawski
On Wed, Jan 10, 2007 at 10:04:11AM +0100, Jarek Poplawski wrote: ... It looks like you're talking about the right thing and I'm a fool again! Now I try to find why I even had to pay for this. I read again and again adequate chapters from R. Love and C. Benvenuti's books, see a lot about

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-10 Thread Stephen Hemminger
On Wed, 10 Jan 2007 13:50:48 +0100 Jarek Poplawski [EMAIL PROTECTED] wrote: On Wed, Jan 10, 2007 at 10:04:11AM +0100, Jarek Poplawski wrote: ... It looks like you're talking about the right thing and I'm a fool again! Now I try to find why I even had to pay for this. I read again and

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-10 Thread Jarek Poplawski
On Wed, Jan 10, 2007 at 12:01:23PM -0800, Stephen Hemminger wrote: ... Don't rely on books too heavily, they can get out of date with a simple code change. I've tried to find this in the code at the beginning and got mislead by the path with PREEMPT_BKL. I think the books are necessary to get

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-10 Thread David Miller
From: Jarek Poplawski [EMAIL PROTECTED] Date: Thu, 11 Jan 2007 08:24:28 +0100 Yesterday I did what I should do earlier - checked this simple way, with printk, and now I have no doubts it's a bug: if you add or remove vlan devices with vconfig, register_vlan_device and unregister_vlan_dev are

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-09 Thread Jarek Poplawski
On Mon, Jan 08, 2007 at 10:03:50AM -0800, Stephen Hemminger wrote: On Mon, 08 Jan 2007 08:57:10 -0800 Ben Greear [EMAIL PROTECTED] wrote: Jarek Poplawski wrote: On Fri, Jan 05, 2007 at 12:33:43PM -0800, Ben Greear wrote: ... So, I do believe this was the problem we were

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-08 Thread Ben Greear
Jarek Poplawski wrote: On Fri, Jan 05, 2007 at 12:33:43PM -0800, Ben Greear wrote: ... So, I do believe this was the problem we were hitting, and it seems fixed. Congratulations! But I can see one strange thing in vlan.c: /* Must be invoked with RCU read lock (no preempt) */ static

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-08 Thread Stephen Hemminger
On Mon, 08 Jan 2007 08:57:10 -0800 Ben Greear [EMAIL PROTECTED] wrote: Jarek Poplawski wrote: On Fri, Jan 05, 2007 at 12:33:43PM -0800, Ben Greear wrote: ... So, I do believe this was the problem we were hitting, and it seems fixed. Congratulations! But I can see one

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-07 Thread Jarek Poplawski
On Fri, Jan 05, 2007 at 12:33:43PM -0800, Ben Greear wrote: ... So, I do believe this was the problem we were hitting, and it seems fixed. Congratulations! But I can see one strange thing in vlan.c: /* Must be invoked with RCU read lock (no preempt) */ static struct vlan_group

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-05 Thread Herbert Xu
On Fri, Jan 05, 2007 at 07:38:44AM +0100, Jarek Poplawski wrote: I'd only suggest to change goto out; to return NULL; at the end of inetdev_init because now RCU is engaged unnecessarily. I agree. The RCU assignment should come before the out label. Can you send a patch? Thanks, -- Visit

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-05 Thread Jarek Poplawski
On Thu, Jan 04, 2007 at 09:04:29AM -0800, Ben Greear wrote: Jarek Poplawski wrote: On Thu, Jan 04, 2007 at 09:27:07PM +1100, Herbert Xu wrote: On Thu, Jan 04, 2007 at 09:50:14AM +0100, Jarek Poplawski wrote: Could you explain? I can see some inet_rtm_newaddr interrupted. For me it

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-04 Thread Jarek Poplawski
On Thu, Jan 04, 2007 at 07:29:30PM +1100, Herbert Xu wrote: On Thu, Jan 04, 2007 at 09:03:51AM +0100, Jarek Poplawski wrote: I doubt this is the right solution. It certainly could fix this particular situation but my main point was packets shouldn't get into kernel receive queues with

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-04 Thread Herbert Xu
On Thu, Jan 04, 2007 at 09:50:14AM +0100, Jarek Poplawski wrote: Could you explain? I can see some inet_rtm_newaddr interrupted. For me it could be e.g.: after vconfig add eth0 9 ip addr add dev eth0.9 ... Whether eth0.9 is up or not does not affect this at all. The spin locks are

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-04 Thread Jarek Poplawski
On Thu, Jan 04, 2007 at 09:27:07PM +1100, Herbert Xu wrote: On Thu, Jan 04, 2007 at 09:50:14AM +0100, Jarek Poplawski wrote: Could you explain? I can see some inet_rtm_newaddr interrupted. For me it could be e.g.: after vconfig add eth0 9 ip addr add dev eth0.9 ... Whether

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-04 Thread Ben Greear
Jarek Poplawski wrote: On Thu, Jan 04, 2007 at 09:27:07PM +1100, Herbert Xu wrote: On Thu, Jan 04, 2007 at 09:50:14AM +0100, Jarek Poplawski wrote: Could you explain? I can see some inet_rtm_newaddr interrupted. For me it could be e.g.: after vconfig add eth0 9 ip addr add dev eth0.9

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-04 Thread David Miller
From: Herbert Xu [EMAIL PROTECTED] Date: Thu, 04 Jan 2007 17:26:27 +1100 David Stevens [EMAIL PROTECTED] wrote: You're right, I don't know whether it'll fix the problem Ben saw or not, but it looks like the original code can do a receive before the in_device is fully initialized,

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-04 Thread Jarek Poplawski
On Thu, Jan 04, 2007 at 12:33:33PM -0800, David Miller wrote: From: Herbert Xu [EMAIL PROTECTED] Date: Thu, 04 Jan 2007 17:26:27 +1100 David Stevens [EMAIL PROTECTED] wrote: You're right, I don't know whether it'll fix the problem Ben saw or not, but it looks like the original

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread Jarek Poplawski
On Tue, Jan 02, 2007 at 03:35:39PM -0800, David Stevens wrote: I've looked at this a little too -- it'd be nice to know who holds the write lock. If you mean mc_list_lock - probably nobody - it's not initialized (so the timers) for this in_device and rtnl mutex is preempted by irq. Actually I

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread Jarek Poplawski
On Wed, Jan 03, 2007 at 09:07:11AM +0100, Jarek Poplawski wrote: On Tue, Jan 02, 2007 at 03:35:39PM -0800, David Stevens wrote: I've looked at this a little too -- it'd be nice to know who holds the write lock. If you mean mc_list_lock - probably nobody - it's not initialized (so the

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread Ben Greear
Jarek Poplawski wrote: On Wed, Jan 03, 2007 at 09:07:11AM +0100, Jarek Poplawski wrote: On Tue, Jan 02, 2007 at 03:35:39PM -0800, David Stevens wrote: I've looked at this a little too -- it'd be nice to know who holds the write lock. If you mean mc_list_lock - probably nobody -

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread David Stevens
Ben Jarek, Your analysis looks correct to me. It seems to me the problem is that we don't want the in_device to be searchable until after the initialization is done. What about moving the initialization of dev-ip_ptr in inetdev_init() to after the out label?

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread David Stevens
Ben, Here's a patch that I think will fix it, assuming the receive is on the same device as the initialization. Can you try this out? +-DLS [inline for viewing, attached for applying] Signed-off-by: David L Stevens [EMAIL PROTECTED] diff

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread Ben Greear
David Stevens wrote: Ben, Here's a patch that I think will fix it, assuming the receive is on the same device as the initialization. Can you try this out? We are attempting to reproduce this now...as soon as we can reproduce, I'll apply this and see if that fixes the problem. This

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread David Stevens
OK, sounds good. By the way, I think you can probably hit it more often if you have something on the virtual network sending lots of multicast traffic while you're creating the interface. That'll increase the odds that you'll get into ip_check_mc() with a partially initialized in_dev. You can

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread Herbert Xu
David Stevens [EMAIL PROTECTED] wrote: Ben, Here's a patch that I think will fix it, assuming the receive is on the same device as the initialization. Can you try this out? Hi David: Your patch makes sense on its own but I don't see the direct connection to the soft lock-up. Sure

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread Ben Greear
Herbert Xu wrote: David Stevens [EMAIL PROTECTED] wrote: Ben, Here's a patch that I think will fix it, assuming the receive is on the same device as the initialization. Can you try this out? Hi David: Your patch makes sense on its own but I don't see the direct connection to the

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread David Stevens
Herbert, You're right, I don't know whether it'll fix the problem Ben saw or not, but it looks like the original code can do a receive before the in_device is fully initialized, and that, of course, is bad. If the device for ip_rcv() is not the same one we were initializing when

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread David Stevens
Ben, If the ip_rcv() and the inetdev_init() are on the same interface in your stack backtrace, it's a certainty at that point that the lock value is still 0ed, because none of the initialization occurs until after it has returned from the function it interrupted to do the receive.

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread Herbert Xu
David Stevens [EMAIL PROTECTED] wrote: You're right, I don't know whether it'll fix the problem Ben saw or not, but it looks like the original code can do a receive before the in_device is fully initialized, and that, of course, is bad. If the device for ip_rcv() is not the same

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-02 Thread Jarek Poplawski
On Tue, Jan 02, 2007 at 08:39:09AM +0100, Jarek Poplawski wrote: ... It is hard to say what kind of bug to expect because at the same time other net_rx_action with the same vlan dev could take place on other processor and this inetdev_init could do more. Sorry! inetdev_init couldn't do more

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-02 Thread Jarek Poplawski
On Tue, Jan 02, 2007 at 09:23:02AM +0100, Jarek Poplawski wrote: On Tue, Jan 02, 2007 at 08:39:09AM +0100, Jarek Poplawski wrote: ... The main thing is the possibility of processing skb with not entirely open source dev which isn't expected (and checked) by receive functions. I think the

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-02 Thread David Stevens
I've looked at this a little too -- it'd be nice to know who holds the write lock. I see ip_mc_destroy_dev() is bouncing through the lock for each multicast address, though it starts at the beginning of the list each time. I don't see a problem with it, but it'd be simpler if it acquired the

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-02 Thread Ben Greear
David Stevens wrote: I've looked at this a little too -- it'd be nice to know who holds the write lock. I see ip_mc_destroy_dev() is bouncing through the lock for each multicast address, though it starts at the beginning of the list each time. I don't see a problem with it, but it'd be simpler

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-01 Thread Ben Greear
I finally had time to look through the code in this backtrace in detail. I think it *could* be a race between ip_rcv and inetdev_init, but I am not certain. Other than that, I'm real low on ideas. I found a few more stack trace debugging options to enable..perhaps that will give a better

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-01 Thread Jarek Poplawski
On Mon, Jan 01, 2007 at 09:00:05PM -0800, Ben Greear wrote: I finally had time to look through the code in this backtrace in detail. I think it *could* be a race between ip_rcv and inetdev_init, but I am not certain. Other than that, I'm real low on ideas. I found a few more stack trace