[asterisk-users] How are your PRI interrupts balanced? (+ Soft lockup BUG)
Hi, I'm trying to figure out the cause of a soft lockup I experienced: Mar 29 09:38:24 pstn1 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [asterisk:32029] Mar 29 09:38:24 pstn1 kernel: Pid: 32029, comm: asterisk Mar 29 09:38:24 pstn1 kernel: EIP: 0060:[c046e7fe] CPU: 0 Mar 29 09:38:24 pstn1 kernel: EIP is at kfree+0x68/0x6c Mar 29 09:38:24 pstn1 kernel: EFLAGS: 0286Tainted: GF (2.6.18-128.1.10.el5 #1) Mar 29 09:38:24 pstn1 kernel: EAX: 0029 EBX: f7ff9380 ECX: f7fff880 EDX: c11ff9a0 Mar 29 09:38:24 pstn1 kernel: ESI: 0286 EDI: cffcda00 EBP: e5e10c80 DS: 007b ES: 007b Mar 29 09:38:24 pstn1 kernel: CR0: 80050033 CR2: b7ce39e0 CR3: 0f911000 CR4: 06d0 Mar 29 09:38:24 pstn1 kernel: [c05b067c] kfree_skbmem+0x8/0x61 Mar 29 09:38:24 pstn1 kernel: [c05e9aaf] __udp_queue_rcv_skb+0x4a/0x51 Mar 29 09:38:24 pstn1 kernel: [c05ad993] release_sock+0x44/0x91 Mar 29 09:38:24 pstn1 kernel: [c05ea939] udp_sendmsg+0x44e/0x514 Mar 29 09:38:24 pstn1 kernel: [c05efdec] inet_sendmsg+0x35/0x3f Mar 29 09:38:24 pstn1 kernel: [c05ab30c] sock_sendmsg+0xce/0xe8 Mar 29 09:38:24 pstn1 kernel: [c043464f] autoremove_wake_function+0x0/0x2d Mar 29 09:38:24 pstn1 kernel: [c04ea17b] copy_from_user+0x17/0x5d Mar 29 09:38:24 pstn1 kernel: [c04ea3a1] copy_to_user+0x31/0x48 Mar 29 09:38:24 pstn1 kernel: [f89ab141] zt_chan_read+0x1e0/0x20b [zaptel] Mar 29 09:38:24 pstn1 kernel: [c04ea195] copy_from_user+0x31/0x5d Mar 29 09:38:24 pstn1 kernel: [c05ac4c4] sys_sendto+0x116/0x140 Mar 29 09:38:24 pstn1 kernel: [c0415d4f] flush_tlb_page+0x74/0x77 Mar 29 09:38:24 pstn1 kernel: [c0461331] do_wp_page+0x3bf/0x40a Mar 29 09:38:24 pstn1 kernel: [c04284f1] current_fs_time+0x4a/0x55 Mar 29 09:38:24 pstn1 kernel: [c0488f9b] touch_atime+0x60/0x91 Mar 29 09:38:24 pstn1 kernel: [c047d9d0] pipe_readv+0x315/0x321 Mar 29 09:38:24 pstn1 kernel: [c05acde4] sys_socketcall+0x106/0x19e Mar 29 09:38:24 pstn1 kernel: [c0404f17] syscall_call+0x7/0xb Mar 29 09:38:24 pstn1 kernel: === This occurred during a high load period (52 calls across 3 PRI spans). A couple days ago I moved the interrupts for my PRI card to CPU0 from CPU3, because CPU3 was handling everything else: CPU0 CPU1 CPU2 CPU3 0:306 0 0 3684057379IO-APIC-edge timer 1: 0 0 0 13468IO-APIC-edge i8042 8: 0 0 0 3IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-level acpi 12: 0 0 0 4IO-APIC-edge i8042 169: 0 0 0 0 IO-APIC-level uhci_hcd:usb2 177: 0 0 0 18392593 IO-APIC-level ata_piix 185: 0 0 0 1 IO-APIC-level ehci_hcd:usb1 193: 0 0 0 0 IO-APIC-level uhci_hcd:usb3 201: 0 0 0 2090021759 IO-APIC-level eth0 209: 149621223 0 0 3534419461 IO-APIC-level wct4xxp (The CPU3 number for wct4xxp is not increasing any more). What is the interrupt distribution of other people's systems? Before I made this change I was having a problem with D-channels dropping occasionally, so I thought it might be an interrupt/load issue. Thank you. -- James -- _ -- Bandwidth and Colocation Provided by http://www.api-digital.com -- New to Asterisk? Join us for a live introductory webinar every Thurs: http://www.asterisk.org/hello asterisk-users mailing list To UNSUBSCRIBE or update options visit: http://lists.digium.com/mailman/listinfo/asterisk-users
Re: [asterisk-users] How are your PRI interrupts balanced? (+ Soft lockup BUG)
Dell server by any chance? I have a similar problem with a TE220B in a Dell 1950 III server - i've seen several other people having issues with digium cards in dell servers as well. I've actually done something similar to what you have done - isolated the TE220B onto its own IRQ and set processor affinity for all the IRQs to particular cores... so far I haven't had kernel pancs since doing this, but its still a little too early to say if it has fixed the issue 100% or not. -- Matt On Mon, Mar 29, 2010 at 8:30 PM, James Lamanna jlama...@gmail.com wrote: Hi, I'm trying to figure out the cause of a soft lockup I experienced: Mar 29 09:38:24 pstn1 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [asterisk:32029] Mar 29 09:38:24 pstn1 kernel: Pid: 32029, comm: asterisk Mar 29 09:38:24 pstn1 kernel: EIP: 0060:[c046e7fe] CPU: 0 Mar 29 09:38:24 pstn1 kernel: EIP is at kfree+0x68/0x6c Mar 29 09:38:24 pstn1 kernel: EFLAGS: 0286Tainted: GF (2.6.18-128.1.10.el5 #1) Mar 29 09:38:24 pstn1 kernel: EAX: 0029 EBX: f7ff9380 ECX: f7fff880 EDX: c11ff9a0 Mar 29 09:38:24 pstn1 kernel: ESI: 0286 EDI: cffcda00 EBP: e5e10c80 DS: 007b ES: 007b Mar 29 09:38:24 pstn1 kernel: CR0: 80050033 CR2: b7ce39e0 CR3: 0f911000 CR4: 06d0 Mar 29 09:38:24 pstn1 kernel: [c05b067c] kfree_skbmem+0x8/0x61 Mar 29 09:38:24 pstn1 kernel: [c05e9aaf] __udp_queue_rcv_skb+0x4a/0x51 Mar 29 09:38:24 pstn1 kernel: [c05ad993] release_sock+0x44/0x91 Mar 29 09:38:24 pstn1 kernel: [c05ea939] udp_sendmsg+0x44e/0x514 Mar 29 09:38:24 pstn1 kernel: [c05efdec] inet_sendmsg+0x35/0x3f Mar 29 09:38:24 pstn1 kernel: [c05ab30c] sock_sendmsg+0xce/0xe8 Mar 29 09:38:24 pstn1 kernel: [c043464f] autoremove_wake_function+0x0/0x2d Mar 29 09:38:24 pstn1 kernel: [c04ea17b] copy_from_user+0x17/0x5d Mar 29 09:38:24 pstn1 kernel: [c04ea3a1] copy_to_user+0x31/0x48 Mar 29 09:38:24 pstn1 kernel: [f89ab141] zt_chan_read+0x1e0/0x20b [zaptel] Mar 29 09:38:24 pstn1 kernel: [c04ea195] copy_from_user+0x31/0x5d Mar 29 09:38:24 pstn1 kernel: [c05ac4c4] sys_sendto+0x116/0x140 Mar 29 09:38:24 pstn1 kernel: [c0415d4f] flush_tlb_page+0x74/0x77 Mar 29 09:38:24 pstn1 kernel: [c0461331] do_wp_page+0x3bf/0x40a Mar 29 09:38:24 pstn1 kernel: [c04284f1] current_fs_time+0x4a/0x55 Mar 29 09:38:24 pstn1 kernel: [c0488f9b] touch_atime+0x60/0x91 Mar 29 09:38:24 pstn1 kernel: [c047d9d0] pipe_readv+0x315/0x321 Mar 29 09:38:24 pstn1 kernel: [c05acde4] sys_socketcall+0x106/0x19e Mar 29 09:38:24 pstn1 kernel: [c0404f17] syscall_call+0x7/0xb Mar 29 09:38:24 pstn1 kernel: === This occurred during a high load period (52 calls across 3 PRI spans). A couple days ago I moved the interrupts for my PRI card to CPU0 from CPU3, because CPU3 was handling everything else: CPU0 CPU1 CPU2 CPU3 0:306 0 0 3684057379IO-APIC-edge timer 1: 0 0 0 13468IO-APIC-edge i8042 8: 0 0 0 3IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-level acpi 12: 0 0 0 4IO-APIC-edge i8042 169: 0 0 0 0 IO-APIC-level uhci_hcd:usb2 177: 0 0 0 18392593 IO-APIC-level ata_piix 185: 0 0 0 1 IO-APIC-level ehci_hcd:usb1 193: 0 0 0 0 IO-APIC-level uhci_hcd:usb3 201: 0 0 0 2090021759 IO-APIC-level eth0 209: 149621223 0 0 3534419461 IO-APIC-level wct4xxp (The CPU3 number for wct4xxp is not increasing any more). What is the interrupt distribution of other people's systems? Before I made this change I was having a problem with D-channels dropping occasionally, so I thought it might be an interrupt/load issue. Thank you. -- James -- _ -- Bandwidth and Colocation Provided by http://www.api-digital.com -- New to Asterisk? Join us for a live introductory webinar every Thurs: http://www.asterisk.org/hello asterisk-users mailing list To UNSUBSCRIBE or update options visit: http://lists.digium.com/mailman/listinfo/asterisk-users -- _ -- Bandwidth and Colocation Provided by http://www.api-digital.com -- New to Asterisk? Join us for a live introductory webinar every Thurs: http://www.asterisk.org/hello asterisk-users mailing list To UNSUBSCRIBE or update options visit: http://lists.digium.com/mailman/listinfo/asterisk-users
Re: [asterisk-users] How are your PRI interrupts balanced? (+ Soft lockup BUG)
On Mon, Mar 29, 2010 at 8:38 PM, Matt Watson m...@mattgwatson.ca wrote: Dell server by any chance? I have a similar problem with a TE220B in a Dell 1950 III server - i've seen several other people having issues with digium cards in dell servers as well. I've actually done something similar to what you have done - isolated the TE220B onto its own IRQ and set processor affinity for all the IRQs to particular cores... so far I haven't had kernel pancs since doing this, but its still a little too early to say if it has fixed the issue 100% or not. Interesting. It is actually a Dell SC1425 - Dual, dual-core Xeon Processors. I'm hopefully going to be able to stress test this machine to see if I can make it panic again with the PRI card IRQ isolated to CPU0. If so, I'll see if it does the same thing on the other cores... -- James -- Matt On Mon, Mar 29, 2010 at 8:30 PM, James Lamanna jlama...@gmail.com wrote: Hi, I'm trying to figure out the cause of a soft lockup I experienced: Mar 29 09:38:24 pstn1 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [asterisk:32029] Mar 29 09:38:24 pstn1 kernel: Pid: 32029, comm: asterisk Mar 29 09:38:24 pstn1 kernel: EIP: 0060:[c046e7fe] CPU: 0 Mar 29 09:38:24 pstn1 kernel: EIP is at kfree+0x68/0x6c Mar 29 09:38:24 pstn1 kernel: EFLAGS: 0286 Tainted: GF (2.6.18-128.1.10.el5 #1) Mar 29 09:38:24 pstn1 kernel: EAX: 0029 EBX: f7ff9380 ECX: f7fff880 EDX: c11ff9a0 Mar 29 09:38:24 pstn1 kernel: ESI: 0286 EDI: cffcda00 EBP: e5e10c80 DS: 007b ES: 007b Mar 29 09:38:24 pstn1 kernel: CR0: 80050033 CR2: b7ce39e0 CR3: 0f911000 CR4: 06d0 Mar 29 09:38:24 pstn1 kernel: [c05b067c] kfree_skbmem+0x8/0x61 Mar 29 09:38:24 pstn1 kernel: [c05e9aaf] __udp_queue_rcv_skb+0x4a/0x51 Mar 29 09:38:24 pstn1 kernel: [c05ad993] release_sock+0x44/0x91 Mar 29 09:38:24 pstn1 kernel: [c05ea939] udp_sendmsg+0x44e/0x514 Mar 29 09:38:24 pstn1 kernel: [c05efdec] inet_sendmsg+0x35/0x3f Mar 29 09:38:24 pstn1 kernel: [c05ab30c] sock_sendmsg+0xce/0xe8 Mar 29 09:38:24 pstn1 kernel: [c043464f] autoremove_wake_function+0x0/0x2d Mar 29 09:38:24 pstn1 kernel: [c04ea17b] copy_from_user+0x17/0x5d Mar 29 09:38:24 pstn1 kernel: [c04ea3a1] copy_to_user+0x31/0x48 Mar 29 09:38:24 pstn1 kernel: [f89ab141] zt_chan_read+0x1e0/0x20b [zaptel] Mar 29 09:38:24 pstn1 kernel: [c04ea195] copy_from_user+0x31/0x5d Mar 29 09:38:24 pstn1 kernel: [c05ac4c4] sys_sendto+0x116/0x140 Mar 29 09:38:24 pstn1 kernel: [c0415d4f] flush_tlb_page+0x74/0x77 Mar 29 09:38:24 pstn1 kernel: [c0461331] do_wp_page+0x3bf/0x40a Mar 29 09:38:24 pstn1 kernel: [c04284f1] current_fs_time+0x4a/0x55 Mar 29 09:38:24 pstn1 kernel: [c0488f9b] touch_atime+0x60/0x91 Mar 29 09:38:24 pstn1 kernel: [c047d9d0] pipe_readv+0x315/0x321 Mar 29 09:38:24 pstn1 kernel: [c05acde4] sys_socketcall+0x106/0x19e Mar 29 09:38:24 pstn1 kernel: [c0404f17] syscall_call+0x7/0xb Mar 29 09:38:24 pstn1 kernel: === This occurred during a high load period (52 calls across 3 PRI spans). A couple days ago I moved the interrupts for my PRI card to CPU0 from CPU3, because CPU3 was handling everything else: CPU0 CPU1 CPU2 CPU3 0: 306 0 0 3684057379 IO-APIC-edge timer 1: 0 0 0 13468 IO-APIC-edge i8042 8: 0 0 0 3 IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-level acpi 12: 0 0 0 4 IO-APIC-edge i8042 169: 0 0 0 0 IO-APIC-level uhci_hcd:usb2 177: 0 0 0 18392593 IO-APIC-level ata_piix 185: 0 0 0 1 IO-APIC-level ehci_hcd:usb1 193: 0 0 0 0 IO-APIC-level uhci_hcd:usb3 201: 0 0 0 2090021759 IO-APIC-level eth0 209: 149621223 0 0 3534419461 IO-APIC-level wct4xxp (The CPU3 number for wct4xxp is not increasing any more). What is the interrupt distribution of other people's systems? Before I made this change I was having a problem with D-channels dropping occasionally, so I thought it might be an interrupt/load issue. Thank you. -- James -- _ -- Bandwidth and Colocation Provided by http://www.api-digital.com -- New to Asterisk? Join us for a live introductory webinar every Thurs: http://www.asterisk.org/hello asterisk-users mailing list To UNSUBSCRIBE or update options visit: http://lists.digium.com/mailman/listinfo/asterisk-users -- _ -- Bandwidth and Colocation Provided by http://www.api-digital.com -- New to Asterisk? Join us for a live introductory webinar every Thurs:
Re: [asterisk-users] How are your PRI interrupts balanced? (+ Soft lockup BUG)
On Mon, Mar 29, 2010 at 9:23 PM, James Lamanna jlama...@gmail.com wrote: On Mon, Mar 29, 2010 at 8:38 PM, Matt Watson m...@mattgwatson.ca wrote: Dell server by any chance? I have a similar problem with a TE220B in a Dell 1950 III server - i've seen several other people having issues with digium cards in dell servers as well. I've actually done something similar to what you have done - isolated the TE220B onto its own IRQ and set processor affinity for all the IRQs to particular cores... so far I haven't had kernel pancs since doing this, but its still a little too early to say if it has fixed the issue 100% or not. Interesting. It is actually a Dell SC1425 - Dual, dual-core Xeon Processors. I'm hopefully going to be able to stress test this machine to see if I can make it panic again with the PRI card IRQ isolated to CPU0. If so, I'll see if it does the same thing on the other cores... As a data point, I tried stress testing this box this evening. Moving the interrupt to each core, the results did not change. The test was as follows: Originate() a call that goes out to the PSTN and comes back in. Both sides used Milliwatt() to make sure audio flowed both ways. I generated 30 calls this way (to use 60 PRI channels), however, I was never able to simultaneously keep 60 channels alive. During the test, there would always be a D-Channel down/up, which would drop all calls on that PRI span. I do not know if this is a Zaptel issue (1.4.12), PRI card issue (TE401P first-gen), or something more subtle... Any help would be appricated! Thanks. -- James -- James -- Matt On Mon, Mar 29, 2010 at 8:30 PM, James Lamanna jlama...@gmail.com wrote: Hi, I'm trying to figure out the cause of a soft lockup I experienced: Mar 29 09:38:24 pstn1 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [asterisk:32029] Mar 29 09:38:24 pstn1 kernel: Pid: 32029, comm: asterisk Mar 29 09:38:24 pstn1 kernel: EIP: 0060:[c046e7fe] CPU: 0 Mar 29 09:38:24 pstn1 kernel: EIP is at kfree+0x68/0x6c Mar 29 09:38:24 pstn1 kernel: EFLAGS: 0286 Tainted: GF (2.6.18-128.1.10.el5 #1) Mar 29 09:38:24 pstn1 kernel: EAX: 0029 EBX: f7ff9380 ECX: f7fff880 EDX: c11ff9a0 Mar 29 09:38:24 pstn1 kernel: ESI: 0286 EDI: cffcda00 EBP: e5e10c80 DS: 007b ES: 007b Mar 29 09:38:24 pstn1 kernel: CR0: 80050033 CR2: b7ce39e0 CR3: 0f911000 CR4: 06d0 Mar 29 09:38:24 pstn1 kernel: [c05b067c] kfree_skbmem+0x8/0x61 Mar 29 09:38:24 pstn1 kernel: [c05e9aaf] __udp_queue_rcv_skb+0x4a/0x51 Mar 29 09:38:24 pstn1 kernel: [c05ad993] release_sock+0x44/0x91 Mar 29 09:38:24 pstn1 kernel: [c05ea939] udp_sendmsg+0x44e/0x514 Mar 29 09:38:24 pstn1 kernel: [c05efdec] inet_sendmsg+0x35/0x3f Mar 29 09:38:24 pstn1 kernel: [c05ab30c] sock_sendmsg+0xce/0xe8 Mar 29 09:38:24 pstn1 kernel: [c043464f] autoremove_wake_function+0x0/0x2d Mar 29 09:38:24 pstn1 kernel: [c04ea17b] copy_from_user+0x17/0x5d Mar 29 09:38:24 pstn1 kernel: [c04ea3a1] copy_to_user+0x31/0x48 Mar 29 09:38:24 pstn1 kernel: [f89ab141] zt_chan_read+0x1e0/0x20b [zaptel] Mar 29 09:38:24 pstn1 kernel: [c04ea195] copy_from_user+0x31/0x5d Mar 29 09:38:24 pstn1 kernel: [c05ac4c4] sys_sendto+0x116/0x140 Mar 29 09:38:24 pstn1 kernel: [c0415d4f] flush_tlb_page+0x74/0x77 Mar 29 09:38:24 pstn1 kernel: [c0461331] do_wp_page+0x3bf/0x40a Mar 29 09:38:24 pstn1 kernel: [c04284f1] current_fs_time+0x4a/0x55 Mar 29 09:38:24 pstn1 kernel: [c0488f9b] touch_atime+0x60/0x91 Mar 29 09:38:24 pstn1 kernel: [c047d9d0] pipe_readv+0x315/0x321 Mar 29 09:38:24 pstn1 kernel: [c05acde4] sys_socketcall+0x106/0x19e Mar 29 09:38:24 pstn1 kernel: [c0404f17] syscall_call+0x7/0xb Mar 29 09:38:24 pstn1 kernel: === This occurred during a high load period (52 calls across 3 PRI spans). A couple days ago I moved the interrupts for my PRI card to CPU0 from CPU3, because CPU3 was handling everything else: CPU0 CPU1 CPU2 CPU3 0: 306 0 0 3684057379 IO-APIC-edge timer 1: 0 0 0 13468 IO-APIC-edge i8042 8: 0 0 0 3 IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-level acpi 12: 0 0 0 4 IO-APIC-edge i8042 169: 0 0 0 0 IO-APIC-level uhci_hcd:usb2 177: 0 0 0 18392593 IO-APIC-level ata_piix 185: 0 0 0 1 IO-APIC-level ehci_hcd:usb1 193: 0 0 0 0 IO-APIC-level uhci_hcd:usb3 201: 0 0 0 2090021759 IO-APIC-level eth0 209: 149621223 0 0 3534419461 IO-APIC-level wct4xxp (The CPU3 number for wct4xxp is not increasing any more). What is the interrupt distribution of other people's systems? Before I made this change I was having a problem with