Bug#490156: Info received (Bug#490156: linux-image-2.6.24-1-686: SMP (2*hyperthreading xeon) machine wedged in loop saying 'BUG: soft lockup - CPU#N stuck for 11s')

2008-12-04 Thread Moritz Muehlenhoff
On Fri, Jul 25, 2008 at 10:09:45AM +0100, Simon A. Boggis wrote:
 Hi,
 
 If not, I'm intending to move them cautiously across to a 2.6.26 to see
 what happens - unfortunately this process is a little bit slow.

Hi Simon,
what are your test results for 2.6.26?

Cheers,
Moritz



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#490156: Info received (Bug#490156: linux-image-2.6.24-1-686: SMP (2*hyperthreading xeon) machine wedged in loop saying 'BUG: soft lockup - CPU#N stuck for 11s')

2008-12-04 Thread Simon A. Boggis

Moritz Muehlenhoff wrote:

On Fri, Jul 25, 2008 at 10:09:45AM +0100, Simon A. Boggis wrote:

Hi,

If not, I'm intending to move them cautiously across to a 2.6.26 to see
what happens - unfortunately this process is a little bit slow.


Hi Simon,
what are your test results for 2.6.26?

Cheers,
Moritz


So far I have not had a recurrence of the bug on 8 systems running under 
 my usual (firewall/router) workload.


Best wishes,

Simon



--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#490156: Info received (Bug#490156: linux-image-2.6.24-1-686: SMP (2*hyperthreading xeon) machine wedged in loop saying 'BUG: soft lockup - CPU#N stuck for 11s')

2008-07-25 Thread Simon A. Boggis
Hi,

just to add a little more information, I've seen this bug again on an
identical set of hardware, running an identical (debian preseed
installed) copy of debian, also on the (now previous) version of the
testing kernel: 2.6.24-1-686. I've attached the log from IPMI serial
console attached.

One thing that I might not have made completely clear last time (sorry
about this, if so) is that the 'BUG: soft lockup...' lines all relate to
the bonding driver:

(previous crash)
BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
BUG: soft lockup - CPU#0 stuck for 11s! [ospf6EBX: f77a7bf8 ECX:
 EDX: f8c6428e
 [c0103e5e] sysenter_past_esp+BUG: soft lockup - CPU#3 stuck for 11s!
[ebr3:2823]
 [c0255e05] sys_socketcall+0x204/0x26BUG: soft lockup - CPU#3 stuck
for 11s! [ebr3:2823]
 [BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
 [c02BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
 [c025460b] sys_sBUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
 [c025460b] sys_setsockopt+0xBUG: soft lockup - CPU#3 stuck for 11s!
[ebr3:2823]
BUG: soft lockup - CPU#0 stuck for 11s! [ospf6d:3647]
 [c0255e05] sys_sockBUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
BUG: soft lockup -__write_lock_failed+0x9/0x1c

On the above machine both 'ebr3' and 'etrA' are both bonded interfaces:
$ grep ^ /sys/class/net/{ebr3,etrA}/bonding/*
/sys/class/net/ebr3/bonding/ad_actor_key:17
/sys/class/net/ebr3/bonding/ad_aggregator:1
/sys/class/net/ebr3/bonding/ad_num_ports:2
/sys/class/net/ebr3/bonding/ad_partner_key:291
/sys/class/net/ebr3/bonding/ad_partner_mac:00:17:a4:b3:2b:00
/sys/class/net/ebr3/bonding/arp_interval:0
/sys/class/net/ebr3/bonding/arp_validate:none 0
/sys/class/net/ebr3/bonding/downdelay:0
Binary file /sys/class/net/ebr3/bonding/fail_over_mac matches
/sys/class/net/ebr3/bonding/lacp_rate:slow 0
/sys/class/net/ebr3/bonding/miimon:100
/sys/class/net/ebr3/bonding/mii_status:up
/sys/class/net/ebr3/bonding/mode:802.3ad 4
/sys/class/net/ebr3/bonding/slaves:etbA etbC
/sys/class/net/ebr3/bonding/updelay:0
/sys/class/net/ebr3/bonding/use_carrier:1
/sys/class/net/ebr3/bonding/xmit_hash_policy:layer2 0
/sys/class/net/etrA/bonding/ad_actor_key:17
/sys/class/net/etrA/bonding/ad_aggregator:1
/sys/class/net/etrA/bonding/ad_num_ports:2
/sys/class/net/etrA/bonding/ad_partner_key:290
/sys/class/net/etrA/bonding/ad_partner_mac:00:17:a4:b3:2b:00
/sys/class/net/etrA/bonding/arp_interval:0
/sys/class/net/etrA/bonding/arp_validate:none 0
/sys/class/net/etrA/bonding/downdelay:0
Binary file /sys/class/net/etrA/bonding/fail_over_mac matches
/sys/class/net/etrA/bonding/lacp_rate:slow 0
/sys/class/net/etrA/bonding/miimon:100
/sys/class/net/etrA/bonding/mii_status:up
/sys/class/net/etrA/bonding/mode:802.3ad 4
/sys/class/net/etrA/bonding/slaves:etbB etbD
/sys/class/net/etrA/bonding/updelay:0
/sys/class/net/etrA/bonding/use_carrier:1
/sys/class/net/etrA/bonding/xmit_hash_policy:layer2 0

(current crash)
BUG: soft lockup - CPU#1 stuck +0xf/0x1c
BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
BUG: soft lockup - CPU#1
BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
BUG: soft lockup - CPU#1 st/0x1c
BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
BUG: soft lockup - CPU#1 stuck for 11s! [ospfd:6839]
 [c025460bBUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
BUG: soft lockup - CPU#1 stuck for 11s! [ospfd:6839]
BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
BUG: soft l_lock_failed+0x9/0x1c
BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
 [c025460b] sys_setsocBUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
 [c0135455] autBUG: soft lockup - CPU#1 stuck for 11s! [ospfd:6839]
BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
BUG: soft lockup - CPU#1 stuck for f6279bf8 EBX: f6279bf8 ECX: 
EDX: f8d1828e

On the above machine etrA is a bonded interface:
$ grep ^ /sys/class/net/etrA/bonding/*
/sys/class/net/etrA/bonding/ad_actor_key:17
/sys/class/net/etrA/bonding/ad_aggregator:1
/sys/class/net/etrA/bonding/ad_num_ports:2
/sys/class/net/etrA/bonding/ad_partner_key:292
/sys/class/net/etrA/bonding/ad_partner_mac:00:17:08:ca:6a:00
/sys/class/net/etrA/bonding/arp_interval:0
/sys/class/net/etrA/bonding/arp_validate:none 0
/sys/class/net/etrA/bonding/downdelay:0
Binary file /sys/class/net/etrA/bonding/fail_over_mac matches
/sys/class/net/etrA/bonding/lacp_rate:slow 0
/sys/class/net/etrA/bonding/miimon:100
/sys/class/net/etrA/bonding/mii_status:up
/sys/class/net/etrA/bonding/mode:802.3ad 4
/sys/class/net/etrA/bonding/slaves:etbB etbD
/sys/class/net/etrA/bonding/updelay:0
/sys/class/net/etrA/bonding/use_carrier:1
/sys/class/net/etrA/bonding/xmit_hash_policy:layer2 0

I note an interesting exchange for ubuntu, concerning ubuntu 8.04 server
with a 2.6.24 kernel:

  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/245779


Bug#490156: linux-image-2.6.24-1-686: SMP (2*hyperthreading xeon) machine wedged in loop saying 'BUG: soft lockup - CPU#N stuck for 11s'

2008-07-11 Thread Jonas Smedegaard
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Thu, Jul 10, 2008 at 11:58:41PM +0100, Simon A. Boggis wrote:
maximilian attems wrote:
 On Thu, Jul 10, 2008 at 11:57:36AM +0100, Simon A. Boggis wrote:
 Package: linux-image-2.6.24-1-686
 Version: 2.6.24-7
 Severity: critical
 Justification: breaks the whole system
 
 overflated severity, learn to set them.
 one or two broken boxes doesn't mean the kernel is unusable on the
 whole. but everybody like to play selfish oh my bug is that important.

Whoa, steady, there's really no need for that! Isn't it evident that I
took a great deal of care in compiling and submitting my report?

I understand your reaction, Simon.  Imagine that Maximilian is dealing 
with hundreds (if not thousands by now) bugreports against the kernels.  
This should not hurt you, of course, but it did.

The thing with the severities is that they relate to Debian as a 
distribution, not each single instance of the system.

the Debian kernel team (which max is a member of, I am just a bystander) 
has judged that even if a kernel breaks the whole system of _yours_ 
that does not make it critical to _Debian_ as a whole.

Confusing, yes.


  - Jonas

- -- 
* Jonas Smedegaard - idealist og Internet-arkitekt
* Tlf.: +45 40843136  Website: http://dr.jones.dk/

  - Enden er nær: http://www.shibumi.org/eoti.htm
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkh3CewACgkQn7DbMsAkQLjauACcDo9ohkx7eXHl247oIME+zOBb
XpsAn1X1vm1lt7KXSW6g3MdxOv0gS+Mn
=Qizs
-END PGP SIGNATURE-



--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#490156: linux-image-2.6.24-1-686: SMP (2*hyperthreading xeon) machine wedged in loop saying 'BUG: soft lockup - CPU#N stuck for 11s'

2008-07-10 Thread maximilian attems
On Thu, Jul 10, 2008 at 11:57:36AM +0100, Simon A. Boggis wrote:
 Package: linux-image-2.6.24-1-686
 Version: 2.6.24-7
 Severity: critical
 Justification: breaks the whole system

overflated severity, learn to set them.
one or two broken boxes doesn't mean the kernel is unusable on the
whole. but everybody like to play selfish oh my bug is that important.
 
 
 I have a number of dual processor xeon machines (hyperthreading cores, Intel 
 SR2400 chassis), giving
 four logical processors thus:
 
 processor : 0
 vendor_id : GenuineIntel
 cpu family: 15
 model : 4
 model name: Intel(R) Xeon(TM) CPU 3.00GHz
 stepping  : 1
 cpu MHz   : 2992.689
 cache size: 1024 KB
 physical id   : 0
 siblings  : 2
 core id   : 0
 cpu cores : 1
 fdiv_bug  : no
 hlt_bug   : no
 f00f_bug  : no
 coma_bug  : no
 fpu   : yes
 fpu_exception : yes
 cpuid level   : 5
 wp: yes
 flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
 pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc 
 pebs bts sync_rdtsc pni monitor ds_cpl cid cx16 xtpr
 bogomips  : 5989.95
 clflush size  : 64
 
 processor : 1
 vendor_id : GenuineIntel
 cpu family: 15
 model : 4
 model name: Intel(R) Xeon(TM) CPU 3.00GHz
 stepping  : 1
 cpu MHz   : 2992.689
 cache size: 1024 KB
 physical id   : 0
 siblings  : 2
 core id   : 0
 cpu cores : 1
 fdiv_bug  : no
 hlt_bug   : no
 f00f_bug  : no
 coma_bug  : no
 fpu   : yes
 fpu_exception : yes
 cpuid level   : 5
 wp: yes
 flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
 pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc 
 pebs bts sync_rdtsc pni monitor ds_cpl cid cx16 xtpr
 bogomips  : 5985.43
 clflush size  : 64
 
 processor : 2
 vendor_id : GenuineIntel
 cpu family: 15
 model : 4
 model name: Intel(R) Xeon(TM) CPU 3.00GHz
 stepping  : 1
 cpu MHz   : 2992.689
 cache size: 1024 KB
 physical id   : 3
 siblings  : 2
 core id   : 0
 cpu cores : 1
 fdiv_bug  : no
 hlt_bug   : no
 f00f_bug  : no
 coma_bug  : no
 fpu   : yes
 fpu_exception : yes
 cpuid level   : 5
 wp: yes
 flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
 pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc 
 pebs bts sync_rdtsc pni monitor ds_cpl cid cx16 xtpr
 bogomips  : 5985.49
 clflush size  : 64
 
 processor : 3
 vendor_id : GenuineIntel
 cpu family: 15
 model : 4
 model name: Intel(R) Xeon(TM) CPU 3.00GHz
 stepping  : 1
 cpu MHz   : 2992.689
 cache size: 1024 KB
 physical id   : 3
 siblings  : 2
 core id   : 0
 cpu cores : 1
 fdiv_bug  : no
 hlt_bug   : no
 f00f_bug  : no
 coma_bug  : no
 fpu   : yes
 fpu_exception : yes
 cpuid level   : 5
 wp: yes
 flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
 pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc 
 pebs bts sync_rdtsc pni monitor ds_cpl cid cx16 xtpr
 bogomips  : 5985.49
 clflush size  : 64
 
 The machines run debian stable with an apt-pinned debian testing (lenny) 
 kernel for some newer features (mainly iptables state 
 tracking).
 
 I'm running the machines as firewall/routers, and because of this I'm using 
 LACP bonding to
 create two logical 2 gigabit interfaces, each composed of:
 
   1 onboard plus one PCI-X e1000
 
 Today I found saw one of my machines disappear off the network at 0935 - as 
 it disappeared our HP 5400 switch reported an LACP
 error:
   I 07/10/08 09:35:08 00393 lacp: Port F1 is blocked - error condition
 
 The machine didn't recover over the course of 20 minutes - when I finally got 
 into the serial console using the onboard IPMI
 management controller I could see that it was stuck in a loop producing the 
 following messages. I wasn't able to get any
 kind of response from it other than this:
 
 SOL Session operational.  Use ?? for help
 BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
 

try out newer 2.6.26-rc9 snapshots, see trunk apt lines
- http://wiki.debian.org/DebianKernel

-- 
maks



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#490156: linux-image-2.6.24-1-686: SMP (2*hyperthreading xeon) machine wedged in loop saying 'BUG: soft lockup - CPU#N stuck for 11s'

2008-07-10 Thread Simon A. Boggis
maximilian attems wrote:
 On Thu, Jul 10, 2008 at 11:57:36AM +0100, Simon A. Boggis wrote:
 Package: linux-image-2.6.24-1-686
 Version: 2.6.24-7
 Severity: critical
 Justification: breaks the whole system
 
 overflated severity, learn to set them.
 one or two broken boxes doesn't mean the kernel is unusable on the
 whole. but everybody like to play selfish oh my bug is that important.

Whoa, steady, there's really no need for that! Isn't it evident that I
took a great deal of care in compiling and submitting my report?

There are a very limited number of choices for severity in reportbug and
I did my best to choose the most appropriate one according to the help
available. Unfortunately the help does not suggest how one ought to
modify the correct level according to whether one, two or a thousand
boxes are broken. The help does however say something along the lines of
if you are not sure, don't worry, the maintainer will assign the
correct level for you. Since I gather from your sparkling sarcasm that
you seem to think you do know better, why not simply change the level
for me if you can, or politely suggest that I do so if you cannot?

If you believe that the help given in reportbug and the debian bug
reporting web page is unclear or wrong, it would be much more
constructive for you to take it up with the respective maintainers.

 
 try out newer 2.6.26-rc9 snapshots, see trunk apt lines
 - http://wiki.debian.org/DebianKernel
 

Thank you, that's a constructive suggestion and more like the kind of
thing I hoped for.

As I said in my report, I'm not sure from reading other problem reports
whether the problem is one which has already been addressed, and I
appreciate some indication that it might be. I will therefore have a
look at a newer version as you suggest.

Thanks for taking the time to answer, and best wishes to you.

Simon



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]