Bug#490156: Info received (Bug#490156: linux-image-2.6.24-1-686: SMP (2*hyperthreading xeon) machine wedged in loop saying 'BUG: soft lockup - CPU#N stuck for 11s')
On Fri, Jul 25, 2008 at 10:09:45AM +0100, Simon A. Boggis wrote: Hi, If not, I'm intending to move them cautiously across to a 2.6.26 to see what happens - unfortunately this process is a little bit slow. Hi Simon, what are your test results for 2.6.26? Cheers, Moritz -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#490156: Info received (Bug#490156: linux-image-2.6.24-1-686: SMP (2*hyperthreading xeon) machine wedged in loop saying 'BUG: soft lockup - CPU#N stuck for 11s')
Moritz Muehlenhoff wrote: On Fri, Jul 25, 2008 at 10:09:45AM +0100, Simon A. Boggis wrote: Hi, If not, I'm intending to move them cautiously across to a 2.6.26 to see what happens - unfortunately this process is a little bit slow. Hi Simon, what are your test results for 2.6.26? Cheers, Moritz So far I have not had a recurrence of the bug on 8 systems running under my usual (firewall/router) workload. Best wishes, Simon -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#490156: Info received (Bug#490156: linux-image-2.6.24-1-686: SMP (2*hyperthreading xeon) machine wedged in loop saying 'BUG: soft lockup - CPU#N stuck for 11s')
Hi, just to add a little more information, I've seen this bug again on an identical set of hardware, running an identical (debian preseed installed) copy of debian, also on the (now previous) version of the testing kernel: 2.6.24-1-686. I've attached the log from IPMI serial console attached. One thing that I might not have made completely clear last time (sorry about this, if so) is that the 'BUG: soft lockup...' lines all relate to the bonding driver: (previous crash) BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823] BUG: soft lockup - CPU#0 stuck for 11s! [ospf6EBX: f77a7bf8 ECX: EDX: f8c6428e [c0103e5e] sysenter_past_esp+BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823] [c0255e05] sys_socketcall+0x204/0x26BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823] [BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823] BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823] [c02BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823] BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823] [c025460b] sys_sBUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823] BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823] [c025460b] sys_setsockopt+0xBUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823] BUG: soft lockup - CPU#0 stuck for 11s! [ospf6d:3647] [c0255e05] sys_sockBUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823] BUG: soft lockup -__write_lock_failed+0x9/0x1c On the above machine both 'ebr3' and 'etrA' are both bonded interfaces: $ grep ^ /sys/class/net/{ebr3,etrA}/bonding/* /sys/class/net/ebr3/bonding/ad_actor_key:17 /sys/class/net/ebr3/bonding/ad_aggregator:1 /sys/class/net/ebr3/bonding/ad_num_ports:2 /sys/class/net/ebr3/bonding/ad_partner_key:291 /sys/class/net/ebr3/bonding/ad_partner_mac:00:17:a4:b3:2b:00 /sys/class/net/ebr3/bonding/arp_interval:0 /sys/class/net/ebr3/bonding/arp_validate:none 0 /sys/class/net/ebr3/bonding/downdelay:0 Binary file /sys/class/net/ebr3/bonding/fail_over_mac matches /sys/class/net/ebr3/bonding/lacp_rate:slow 0 /sys/class/net/ebr3/bonding/miimon:100 /sys/class/net/ebr3/bonding/mii_status:up /sys/class/net/ebr3/bonding/mode:802.3ad 4 /sys/class/net/ebr3/bonding/slaves:etbA etbC /sys/class/net/ebr3/bonding/updelay:0 /sys/class/net/ebr3/bonding/use_carrier:1 /sys/class/net/ebr3/bonding/xmit_hash_policy:layer2 0 /sys/class/net/etrA/bonding/ad_actor_key:17 /sys/class/net/etrA/bonding/ad_aggregator:1 /sys/class/net/etrA/bonding/ad_num_ports:2 /sys/class/net/etrA/bonding/ad_partner_key:290 /sys/class/net/etrA/bonding/ad_partner_mac:00:17:a4:b3:2b:00 /sys/class/net/etrA/bonding/arp_interval:0 /sys/class/net/etrA/bonding/arp_validate:none 0 /sys/class/net/etrA/bonding/downdelay:0 Binary file /sys/class/net/etrA/bonding/fail_over_mac matches /sys/class/net/etrA/bonding/lacp_rate:slow 0 /sys/class/net/etrA/bonding/miimon:100 /sys/class/net/etrA/bonding/mii_status:up /sys/class/net/etrA/bonding/mode:802.3ad 4 /sys/class/net/etrA/bonding/slaves:etbB etbD /sys/class/net/etrA/bonding/updelay:0 /sys/class/net/etrA/bonding/use_carrier:1 /sys/class/net/etrA/bonding/xmit_hash_policy:layer2 0 (current crash) BUG: soft lockup - CPU#1 stuck +0xf/0x1c BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443] BUG: soft lockup - CPU#1 BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443] BUG: soft lockup - CPU#1 st/0x1c BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443] BUG: soft lockup - CPU#1 stuck for 11s! [ospfd:6839] [c025460bBUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443] BUG: soft lockup - CPU#1 stuck for 11s! [ospfd:6839] BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443] BUG: soft l_lock_failed+0x9/0x1c BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443] [c025460b] sys_setsocBUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443] [c0135455] autBUG: soft lockup - CPU#1 stuck for 11s! [ospfd:6839] BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443] BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443] BUG: soft lockup - CPU#1 stuck for f6279bf8 EBX: f6279bf8 ECX: EDX: f8d1828e On the above machine etrA is a bonded interface: $ grep ^ /sys/class/net/etrA/bonding/* /sys/class/net/etrA/bonding/ad_actor_key:17 /sys/class/net/etrA/bonding/ad_aggregator:1 /sys/class/net/etrA/bonding/ad_num_ports:2 /sys/class/net/etrA/bonding/ad_partner_key:292 /sys/class/net/etrA/bonding/ad_partner_mac:00:17:08:ca:6a:00 /sys/class/net/etrA/bonding/arp_interval:0 /sys/class/net/etrA/bonding/arp_validate:none 0 /sys/class/net/etrA/bonding/downdelay:0 Binary file /sys/class/net/etrA/bonding/fail_over_mac matches /sys/class/net/etrA/bonding/lacp_rate:slow 0 /sys/class/net/etrA/bonding/miimon:100 /sys/class/net/etrA/bonding/mii_status:up /sys/class/net/etrA/bonding/mode:802.3ad 4 /sys/class/net/etrA/bonding/slaves:etbB etbD /sys/class/net/etrA/bonding/updelay:0 /sys/class/net/etrA/bonding/use_carrier:1 /sys/class/net/etrA/bonding/xmit_hash_policy:layer2 0 I note an interesting exchange for ubuntu, concerning ubuntu 8.04 server with a 2.6.24 kernel: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/245779
Bug#490156: linux-image-2.6.24-1-686: SMP (2*hyperthreading xeon) machine wedged in loop saying 'BUG: soft lockup - CPU#N stuck for 11s'
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Thu, Jul 10, 2008 at 11:58:41PM +0100, Simon A. Boggis wrote: maximilian attems wrote: On Thu, Jul 10, 2008 at 11:57:36AM +0100, Simon A. Boggis wrote: Package: linux-image-2.6.24-1-686 Version: 2.6.24-7 Severity: critical Justification: breaks the whole system overflated severity, learn to set them. one or two broken boxes doesn't mean the kernel is unusable on the whole. but everybody like to play selfish oh my bug is that important. Whoa, steady, there's really no need for that! Isn't it evident that I took a great deal of care in compiling and submitting my report? I understand your reaction, Simon. Imagine that Maximilian is dealing with hundreds (if not thousands by now) bugreports against the kernels. This should not hurt you, of course, but it did. The thing with the severities is that they relate to Debian as a distribution, not each single instance of the system. the Debian kernel team (which max is a member of, I am just a bystander) has judged that even if a kernel breaks the whole system of _yours_ that does not make it critical to _Debian_ as a whole. Confusing, yes. - Jonas - -- * Jonas Smedegaard - idealist og Internet-arkitekt * Tlf.: +45 40843136 Website: http://dr.jones.dk/ - Enden er nær: http://www.shibumi.org/eoti.htm -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkh3CewACgkQn7DbMsAkQLjauACcDo9ohkx7eXHl247oIME+zOBb XpsAn1X1vm1lt7KXSW6g3MdxOv0gS+Mn =Qizs -END PGP SIGNATURE- -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#490156: linux-image-2.6.24-1-686: SMP (2*hyperthreading xeon) machine wedged in loop saying 'BUG: soft lockup - CPU#N stuck for 11s'
On Thu, Jul 10, 2008 at 11:57:36AM +0100, Simon A. Boggis wrote: Package: linux-image-2.6.24-1-686 Version: 2.6.24-7 Severity: critical Justification: breaks the whole system overflated severity, learn to set them. one or two broken boxes doesn't mean the kernel is unusable on the whole. but everybody like to play selfish oh my bug is that important. I have a number of dual processor xeon machines (hyperthreading cores, Intel SR2400 chassis), giving four logical processors thus: processor : 0 vendor_id : GenuineIntel cpu family: 15 model : 4 model name: Intel(R) Xeon(TM) CPU 3.00GHz stepping : 1 cpu MHz : 2992.689 cache size: 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp: yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl cid cx16 xtpr bogomips : 5989.95 clflush size : 64 processor : 1 vendor_id : GenuineIntel cpu family: 15 model : 4 model name: Intel(R) Xeon(TM) CPU 3.00GHz stepping : 1 cpu MHz : 2992.689 cache size: 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp: yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl cid cx16 xtpr bogomips : 5985.43 clflush size : 64 processor : 2 vendor_id : GenuineIntel cpu family: 15 model : 4 model name: Intel(R) Xeon(TM) CPU 3.00GHz stepping : 1 cpu MHz : 2992.689 cache size: 1024 KB physical id : 3 siblings : 2 core id : 0 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp: yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl cid cx16 xtpr bogomips : 5985.49 clflush size : 64 processor : 3 vendor_id : GenuineIntel cpu family: 15 model : 4 model name: Intel(R) Xeon(TM) CPU 3.00GHz stepping : 1 cpu MHz : 2992.689 cache size: 1024 KB physical id : 3 siblings : 2 core id : 0 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp: yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl cid cx16 xtpr bogomips : 5985.49 clflush size : 64 The machines run debian stable with an apt-pinned debian testing (lenny) kernel for some newer features (mainly iptables state tracking). I'm running the machines as firewall/routers, and because of this I'm using LACP bonding to create two logical 2 gigabit interfaces, each composed of: 1 onboard plus one PCI-X e1000 Today I found saw one of my machines disappear off the network at 0935 - as it disappeared our HP 5400 switch reported an LACP error: I 07/10/08 09:35:08 00393 lacp: Port F1 is blocked - error condition The machine didn't recover over the course of 20 minutes - when I finally got into the serial console using the onboard IPMI management controller I could see that it was stuck in a loop producing the following messages. I wasn't able to get any kind of response from it other than this: SOL Session operational. Use ?? for help BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823] try out newer 2.6.26-rc9 snapshots, see trunk apt lines - http://wiki.debian.org/DebianKernel -- maks -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#490156: linux-image-2.6.24-1-686: SMP (2*hyperthreading xeon) machine wedged in loop saying 'BUG: soft lockup - CPU#N stuck for 11s'
maximilian attems wrote: On Thu, Jul 10, 2008 at 11:57:36AM +0100, Simon A. Boggis wrote: Package: linux-image-2.6.24-1-686 Version: 2.6.24-7 Severity: critical Justification: breaks the whole system overflated severity, learn to set them. one or two broken boxes doesn't mean the kernel is unusable on the whole. but everybody like to play selfish oh my bug is that important. Whoa, steady, there's really no need for that! Isn't it evident that I took a great deal of care in compiling and submitting my report? There are a very limited number of choices for severity in reportbug and I did my best to choose the most appropriate one according to the help available. Unfortunately the help does not suggest how one ought to modify the correct level according to whether one, two or a thousand boxes are broken. The help does however say something along the lines of if you are not sure, don't worry, the maintainer will assign the correct level for you. Since I gather from your sparkling sarcasm that you seem to think you do know better, why not simply change the level for me if you can, or politely suggest that I do so if you cannot? If you believe that the help given in reportbug and the debian bug reporting web page is unclear or wrong, it would be much more constructive for you to take it up with the respective maintainers. try out newer 2.6.26-rc9 snapshots, see trunk apt lines - http://wiki.debian.org/DebianKernel Thank you, that's a constructive suggestion and more like the kind of thing I hoped for. As I said in my report, I'm not sure from reading other problem reports whether the problem is one which has already been addressed, and I appreciate some indication that it might be. I will therefore have a look at a newer version as you suggest. Thanks for taking the time to answer, and best wishes to you. Simon -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]