[asterisk-users] How are your PRI interrupts balanced? (+ Soft lockup BUG)

2010-03-29 Thread James Lamanna
Hi,
I'm trying to figure out the cause of a soft lockup I experienced:

Mar 29 09:38:24 pstn1 kernel: BUG: soft lockup - CPU#0 stuck for 10s!
[asterisk:32029]
Mar 29 09:38:24 pstn1 kernel: Pid: 32029, comm: asterisk
Mar 29 09:38:24 pstn1 kernel: EIP: 0060:[c046e7fe] CPU: 0
Mar 29 09:38:24 pstn1 kernel: EIP is at kfree+0x68/0x6c
Mar 29 09:38:24 pstn1 kernel:  EFLAGS: 0286Tainted: GF
(2.6.18-128.1.10.el5 #1)
Mar 29 09:38:24 pstn1 kernel: EAX: 0029 EBX: f7ff9380 ECX:
f7fff880 EDX: c11ff9a0
Mar 29 09:38:24 pstn1 kernel: ESI: 0286 EDI: cffcda00 EBP:
e5e10c80 DS: 007b ES: 007b
Mar 29 09:38:24 pstn1 kernel: CR0: 80050033 CR2: b7ce39e0 CR3:
0f911000 CR4: 06d0
Mar 29 09:38:24 pstn1 kernel:  [c05b067c] kfree_skbmem+0x8/0x61
Mar 29 09:38:24 pstn1 kernel:  [c05e9aaf] __udp_queue_rcv_skb+0x4a/0x51
Mar 29 09:38:24 pstn1 kernel:  [c05ad993] release_sock+0x44/0x91
Mar 29 09:38:24 pstn1 kernel:  [c05ea939] udp_sendmsg+0x44e/0x514
Mar 29 09:38:24 pstn1 kernel:  [c05efdec] inet_sendmsg+0x35/0x3f
Mar 29 09:38:24 pstn1 kernel:  [c05ab30c] sock_sendmsg+0xce/0xe8
Mar 29 09:38:24 pstn1 kernel:  [c043464f] autoremove_wake_function+0x0/0x2d
Mar 29 09:38:24 pstn1 kernel:  [c04ea17b] copy_from_user+0x17/0x5d
Mar 29 09:38:24 pstn1 kernel:  [c04ea3a1] copy_to_user+0x31/0x48
Mar 29 09:38:24 pstn1 kernel:  [f89ab141] zt_chan_read+0x1e0/0x20b [zaptel]
Mar 29 09:38:24 pstn1 kernel:  [c04ea195] copy_from_user+0x31/0x5d
Mar 29 09:38:24 pstn1 kernel:  [c05ac4c4] sys_sendto+0x116/0x140
Mar 29 09:38:24 pstn1 kernel:  [c0415d4f] flush_tlb_page+0x74/0x77
Mar 29 09:38:24 pstn1 kernel:  [c0461331] do_wp_page+0x3bf/0x40a
Mar 29 09:38:24 pstn1 kernel:  [c04284f1] current_fs_time+0x4a/0x55
Mar 29 09:38:24 pstn1 kernel:  [c0488f9b] touch_atime+0x60/0x91
Mar 29 09:38:24 pstn1 kernel:  [c047d9d0] pipe_readv+0x315/0x321
Mar 29 09:38:24 pstn1 kernel:  [c05acde4] sys_socketcall+0x106/0x19e
Mar 29 09:38:24 pstn1 kernel:  [c0404f17] syscall_call+0x7/0xb
Mar 29 09:38:24 pstn1 kernel:  ===


This occurred during a high load period (52 calls across 3 PRI spans).

A couple days ago I moved the interrupts for my PRI card to CPU0 from
CPU3, because CPU3 was handling everything else:
   CPU0   CPU1   CPU2   CPU3
  0:306  0  0 3684057379IO-APIC-edge  timer
  1:  0  0  0  13468IO-APIC-edge  i8042
  8:  0  0  0  3IO-APIC-edge  rtc
  9:  0  0  0  0   IO-APIC-level  acpi
 12:  0  0  0  4IO-APIC-edge  i8042
169:  0  0  0  0   IO-APIC-level  uhci_hcd:usb2
177:  0  0  0   18392593   IO-APIC-level  ata_piix
185:  0  0  0  1   IO-APIC-level  ehci_hcd:usb1
193:  0  0  0  0   IO-APIC-level  uhci_hcd:usb3
201:  0  0  0 2090021759   IO-APIC-level  eth0
209:  149621223  0  0 3534419461   IO-APIC-level  wct4xxp


(The CPU3 number for wct4xxp is not increasing any more).

What is the interrupt distribution of other people's systems?
Before I made this change I was having a problem with D-channels
dropping occasionally, so I thought it might be an interrupt/load
issue.

Thank you.

-- James

-- 
_
-- Bandwidth and Colocation Provided by http://www.api-digital.com --
New to Asterisk? Join us for a live introductory webinar every Thurs:
   http://www.asterisk.org/hello

asterisk-users mailing list
To UNSUBSCRIBE or update options visit:
   http://lists.digium.com/mailman/listinfo/asterisk-users


Re: [asterisk-users] How are your PRI interrupts balanced? (+ Soft lockup BUG)

2010-03-29 Thread Matt Watson
Dell server by any chance?

I have a similar problem with a TE220B in a Dell 1950 III server - i've seen
several other people having issues with digium cards in dell servers as
well.

I've actually done something similar to what you have done - isolated the
TE220B onto its own IRQ and set processor affinity for all the IRQs to
particular cores... so far I haven't had kernel pancs since doing this, but
its still a little too early to say if it has fixed the issue 100% or not.

--
Matt

On Mon, Mar 29, 2010 at 8:30 PM, James Lamanna jlama...@gmail.com wrote:

 Hi,
 I'm trying to figure out the cause of a soft lockup I experienced:

 Mar 29 09:38:24 pstn1 kernel: BUG: soft lockup - CPU#0 stuck for 10s!
 [asterisk:32029]
 Mar 29 09:38:24 pstn1 kernel: Pid: 32029, comm: asterisk
 Mar 29 09:38:24 pstn1 kernel: EIP: 0060:[c046e7fe] CPU: 0
 Mar 29 09:38:24 pstn1 kernel: EIP is at kfree+0x68/0x6c
 Mar 29 09:38:24 pstn1 kernel:  EFLAGS: 0286Tainted: GF
 (2.6.18-128.1.10.el5 #1)
 Mar 29 09:38:24 pstn1 kernel: EAX: 0029 EBX: f7ff9380 ECX:
 f7fff880 EDX: c11ff9a0
 Mar 29 09:38:24 pstn1 kernel: ESI: 0286 EDI: cffcda00 EBP:
 e5e10c80 DS: 007b ES: 007b
 Mar 29 09:38:24 pstn1 kernel: CR0: 80050033 CR2: b7ce39e0 CR3:
 0f911000 CR4: 06d0
 Mar 29 09:38:24 pstn1 kernel:  [c05b067c] kfree_skbmem+0x8/0x61
 Mar 29 09:38:24 pstn1 kernel:  [c05e9aaf] __udp_queue_rcv_skb+0x4a/0x51
 Mar 29 09:38:24 pstn1 kernel:  [c05ad993] release_sock+0x44/0x91
 Mar 29 09:38:24 pstn1 kernel:  [c05ea939] udp_sendmsg+0x44e/0x514
 Mar 29 09:38:24 pstn1 kernel:  [c05efdec] inet_sendmsg+0x35/0x3f
 Mar 29 09:38:24 pstn1 kernel:  [c05ab30c] sock_sendmsg+0xce/0xe8
 Mar 29 09:38:24 pstn1 kernel:  [c043464f]
 autoremove_wake_function+0x0/0x2d
 Mar 29 09:38:24 pstn1 kernel:  [c04ea17b] copy_from_user+0x17/0x5d
 Mar 29 09:38:24 pstn1 kernel:  [c04ea3a1] copy_to_user+0x31/0x48
 Mar 29 09:38:24 pstn1 kernel:  [f89ab141] zt_chan_read+0x1e0/0x20b
 [zaptel]
 Mar 29 09:38:24 pstn1 kernel:  [c04ea195] copy_from_user+0x31/0x5d
 Mar 29 09:38:24 pstn1 kernel:  [c05ac4c4] sys_sendto+0x116/0x140
 Mar 29 09:38:24 pstn1 kernel:  [c0415d4f] flush_tlb_page+0x74/0x77
 Mar 29 09:38:24 pstn1 kernel:  [c0461331] do_wp_page+0x3bf/0x40a
 Mar 29 09:38:24 pstn1 kernel:  [c04284f1] current_fs_time+0x4a/0x55
 Mar 29 09:38:24 pstn1 kernel:  [c0488f9b] touch_atime+0x60/0x91
 Mar 29 09:38:24 pstn1 kernel:  [c047d9d0] pipe_readv+0x315/0x321
 Mar 29 09:38:24 pstn1 kernel:  [c05acde4] sys_socketcall+0x106/0x19e
 Mar 29 09:38:24 pstn1 kernel:  [c0404f17] syscall_call+0x7/0xb
 Mar 29 09:38:24 pstn1 kernel:  ===


 This occurred during a high load period (52 calls across 3 PRI spans).

 A couple days ago I moved the interrupts for my PRI card to CPU0 from
 CPU3, because CPU3 was handling everything else:
   CPU0   CPU1   CPU2   CPU3
  0:306  0  0 3684057379IO-APIC-edge  timer
  1:  0  0  0  13468IO-APIC-edge  i8042
  8:  0  0  0  3IO-APIC-edge  rtc
  9:  0  0  0  0   IO-APIC-level  acpi
  12:  0  0  0  4IO-APIC-edge  i8042
 169:  0  0  0  0   IO-APIC-level
  uhci_hcd:usb2
 177:  0  0  0   18392593   IO-APIC-level  ata_piix
 185:  0  0  0  1   IO-APIC-level
  ehci_hcd:usb1
 193:  0  0  0  0   IO-APIC-level
  uhci_hcd:usb3
 201:  0  0  0 2090021759   IO-APIC-level  eth0
 209:  149621223  0  0 3534419461   IO-APIC-level  wct4xxp


 (The CPU3 number for wct4xxp is not increasing any more).

 What is the interrupt distribution of other people's systems?
 Before I made this change I was having a problem with D-channels
 dropping occasionally, so I thought it might be an interrupt/load
 issue.

 Thank you.

 -- James

 --
 _
 -- Bandwidth and Colocation Provided by http://www.api-digital.com --
 New to Asterisk? Join us for a live introductory webinar every Thurs:
   http://www.asterisk.org/hello

 asterisk-users mailing list
 To UNSUBSCRIBE or update options visit:
   http://lists.digium.com/mailman/listinfo/asterisk-users

-- 
_
-- Bandwidth and Colocation Provided by http://www.api-digital.com --
New to Asterisk? Join us for a live introductory webinar every Thurs:
   http://www.asterisk.org/hello

asterisk-users mailing list
To UNSUBSCRIBE or update options visit:
   http://lists.digium.com/mailman/listinfo/asterisk-users

Re: [asterisk-users] How are your PRI interrupts balanced? (+ Soft lockup BUG)

2010-03-29 Thread James Lamanna
On Mon, Mar 29, 2010 at 8:38 PM, Matt Watson m...@mattgwatson.ca wrote:
 Dell server by any chance?
 I have a similar problem with a TE220B in a Dell 1950 III server - i've seen
 several other people having issues with digium cards in dell servers as
 well.
 I've actually done something similar to what you have done - isolated the
 TE220B onto its own IRQ and set processor affinity for all the IRQs to
 particular cores... so far I haven't had kernel pancs since doing this, but
 its still a little too early to say if it has fixed the issue 100% or not.

Interesting. It is actually a Dell SC1425 - Dual, dual-core Xeon Processors.
I'm hopefully going to be able to stress test this machine to see if I
can make it panic again with the PRI card IRQ isolated to CPU0. If so,
I'll see if it does the same thing on the other cores...

-- James

 --
 Matt

 On Mon, Mar 29, 2010 at 8:30 PM, James Lamanna jlama...@gmail.com wrote:

 Hi,
 I'm trying to figure out the cause of a soft lockup I experienced:

 Mar 29 09:38:24 pstn1 kernel: BUG: soft lockup - CPU#0 stuck for 10s!
 [asterisk:32029]
 Mar 29 09:38:24 pstn1 kernel: Pid: 32029, comm:             asterisk
 Mar 29 09:38:24 pstn1 kernel: EIP: 0060:[c046e7fe] CPU: 0
 Mar 29 09:38:24 pstn1 kernel: EIP is at kfree+0x68/0x6c
 Mar 29 09:38:24 pstn1 kernel:  EFLAGS: 0286    Tainted: GF
 (2.6.18-128.1.10.el5 #1)
 Mar 29 09:38:24 pstn1 kernel: EAX: 0029 EBX: f7ff9380 ECX:
 f7fff880 EDX: c11ff9a0
 Mar 29 09:38:24 pstn1 kernel: ESI: 0286 EDI: cffcda00 EBP:
 e5e10c80 DS: 007b ES: 007b
 Mar 29 09:38:24 pstn1 kernel: CR0: 80050033 CR2: b7ce39e0 CR3:
 0f911000 CR4: 06d0
 Mar 29 09:38:24 pstn1 kernel:  [c05b067c] kfree_skbmem+0x8/0x61
 Mar 29 09:38:24 pstn1 kernel:  [c05e9aaf] __udp_queue_rcv_skb+0x4a/0x51
 Mar 29 09:38:24 pstn1 kernel:  [c05ad993] release_sock+0x44/0x91
 Mar 29 09:38:24 pstn1 kernel:  [c05ea939] udp_sendmsg+0x44e/0x514
 Mar 29 09:38:24 pstn1 kernel:  [c05efdec] inet_sendmsg+0x35/0x3f
 Mar 29 09:38:24 pstn1 kernel:  [c05ab30c] sock_sendmsg+0xce/0xe8
 Mar 29 09:38:24 pstn1 kernel:  [c043464f]
 autoremove_wake_function+0x0/0x2d
 Mar 29 09:38:24 pstn1 kernel:  [c04ea17b] copy_from_user+0x17/0x5d
 Mar 29 09:38:24 pstn1 kernel:  [c04ea3a1] copy_to_user+0x31/0x48
 Mar 29 09:38:24 pstn1 kernel:  [f89ab141] zt_chan_read+0x1e0/0x20b
 [zaptel]
 Mar 29 09:38:24 pstn1 kernel:  [c04ea195] copy_from_user+0x31/0x5d
 Mar 29 09:38:24 pstn1 kernel:  [c05ac4c4] sys_sendto+0x116/0x140
 Mar 29 09:38:24 pstn1 kernel:  [c0415d4f] flush_tlb_page+0x74/0x77
 Mar 29 09:38:24 pstn1 kernel:  [c0461331] do_wp_page+0x3bf/0x40a
 Mar 29 09:38:24 pstn1 kernel:  [c04284f1] current_fs_time+0x4a/0x55
 Mar 29 09:38:24 pstn1 kernel:  [c0488f9b] touch_atime+0x60/0x91
 Mar 29 09:38:24 pstn1 kernel:  [c047d9d0] pipe_readv+0x315/0x321
 Mar 29 09:38:24 pstn1 kernel:  [c05acde4] sys_socketcall+0x106/0x19e
 Mar 29 09:38:24 pstn1 kernel:  [c0404f17] syscall_call+0x7/0xb
 Mar 29 09:38:24 pstn1 kernel:  ===


 This occurred during a high load period (52 calls across 3 PRI spans).

 A couple days ago I moved the interrupts for my PRI card to CPU0 from
 CPU3, because CPU3 was handling everything else:
           CPU0       CPU1       CPU2       CPU3
  0:        306          0          0 3684057379    IO-APIC-edge  timer
  1:          0          0          0      13468    IO-APIC-edge  i8042
  8:          0          0          0          3    IO-APIC-edge  rtc
  9:          0          0          0          0   IO-APIC-level  acpi
  12:          0          0          0          4    IO-APIC-edge  i8042
 169:          0          0          0          0   IO-APIC-level
  uhci_hcd:usb2
 177:          0          0          0   18392593   IO-APIC-level  ata_piix
 185:          0          0          0          1   IO-APIC-level
  ehci_hcd:usb1
 193:          0          0          0          0   IO-APIC-level
  uhci_hcd:usb3
 201:          0          0          0 2090021759   IO-APIC-level  eth0
 209:  149621223          0          0 3534419461   IO-APIC-level  wct4xxp


 (The CPU3 number for wct4xxp is not increasing any more).

 What is the interrupt distribution of other people's systems?
 Before I made this change I was having a problem with D-channels
 dropping occasionally, so I thought it might be an interrupt/load
 issue.

 Thank you.

 -- James

 --
 _
 -- Bandwidth and Colocation Provided by http://www.api-digital.com --
 New to Asterisk? Join us for a live introductory webinar every Thurs:
               http://www.asterisk.org/hello

 asterisk-users mailing list
 To UNSUBSCRIBE or update options visit:
   http://lists.digium.com/mailman/listinfo/asterisk-users


 --
 _
 -- Bandwidth and Colocation Provided by http://www.api-digital.com --
 New to Asterisk? Join us for a live introductory webinar every Thurs:
               

Re: [asterisk-users] How are your PRI interrupts balanced? (+ Soft lockup BUG)

2010-03-29 Thread James Lamanna
On Mon, Mar 29, 2010 at 9:23 PM, James Lamanna jlama...@gmail.com wrote:
 On Mon, Mar 29, 2010 at 8:38 PM, Matt Watson m...@mattgwatson.ca wrote:
 Dell server by any chance?
 I have a similar problem with a TE220B in a Dell 1950 III server - i've seen
 several other people having issues with digium cards in dell servers as
 well.
 I've actually done something similar to what you have done - isolated the
 TE220B onto its own IRQ and set processor affinity for all the IRQs to
 particular cores... so far I haven't had kernel pancs since doing this, but
 its still a little too early to say if it has fixed the issue 100% or not.

 Interesting. It is actually a Dell SC1425 - Dual, dual-core Xeon Processors.
 I'm hopefully going to be able to stress test this machine to see if I
 can make it panic again with the PRI card IRQ isolated to CPU0. If so,
 I'll see if it does the same thing on the other cores...

As a data point, I tried stress testing this box this evening. Moving
the interrupt to each core, the results did not change.
The test was as follows:
Originate() a call that goes out to the PSTN and comes back in. Both
sides used Milliwatt() to make sure audio flowed both ways.
I generated 30 calls this way (to use 60 PRI channels), however, I was
never able to simultaneously keep 60 channels alive. During the test,
there would always be a D-Channel down/up, which would drop all calls
on that PRI span.
I do not know if this is a Zaptel issue (1.4.12), PRI card issue
(TE401P first-gen), or something more subtle...

Any help would be appricated!

Thanks.

-- James


 -- James

 --
 Matt

 On Mon, Mar 29, 2010 at 8:30 PM, James Lamanna jlama...@gmail.com wrote:

 Hi,
 I'm trying to figure out the cause of a soft lockup I experienced:

 Mar 29 09:38:24 pstn1 kernel: BUG: soft lockup - CPU#0 stuck for 10s!
 [asterisk:32029]
 Mar 29 09:38:24 pstn1 kernel: Pid: 32029, comm:             asterisk
 Mar 29 09:38:24 pstn1 kernel: EIP: 0060:[c046e7fe] CPU: 0
 Mar 29 09:38:24 pstn1 kernel: EIP is at kfree+0x68/0x6c
 Mar 29 09:38:24 pstn1 kernel:  EFLAGS: 0286    Tainted: GF
 (2.6.18-128.1.10.el5 #1)
 Mar 29 09:38:24 pstn1 kernel: EAX: 0029 EBX: f7ff9380 ECX:
 f7fff880 EDX: c11ff9a0
 Mar 29 09:38:24 pstn1 kernel: ESI: 0286 EDI: cffcda00 EBP:
 e5e10c80 DS: 007b ES: 007b
 Mar 29 09:38:24 pstn1 kernel: CR0: 80050033 CR2: b7ce39e0 CR3:
 0f911000 CR4: 06d0
 Mar 29 09:38:24 pstn1 kernel:  [c05b067c] kfree_skbmem+0x8/0x61
 Mar 29 09:38:24 pstn1 kernel:  [c05e9aaf] __udp_queue_rcv_skb+0x4a/0x51
 Mar 29 09:38:24 pstn1 kernel:  [c05ad993] release_sock+0x44/0x91
 Mar 29 09:38:24 pstn1 kernel:  [c05ea939] udp_sendmsg+0x44e/0x514
 Mar 29 09:38:24 pstn1 kernel:  [c05efdec] inet_sendmsg+0x35/0x3f
 Mar 29 09:38:24 pstn1 kernel:  [c05ab30c] sock_sendmsg+0xce/0xe8
 Mar 29 09:38:24 pstn1 kernel:  [c043464f]
 autoremove_wake_function+0x0/0x2d
 Mar 29 09:38:24 pstn1 kernel:  [c04ea17b] copy_from_user+0x17/0x5d
 Mar 29 09:38:24 pstn1 kernel:  [c04ea3a1] copy_to_user+0x31/0x48
 Mar 29 09:38:24 pstn1 kernel:  [f89ab141] zt_chan_read+0x1e0/0x20b
 [zaptel]
 Mar 29 09:38:24 pstn1 kernel:  [c04ea195] copy_from_user+0x31/0x5d
 Mar 29 09:38:24 pstn1 kernel:  [c05ac4c4] sys_sendto+0x116/0x140
 Mar 29 09:38:24 pstn1 kernel:  [c0415d4f] flush_tlb_page+0x74/0x77
 Mar 29 09:38:24 pstn1 kernel:  [c0461331] do_wp_page+0x3bf/0x40a
 Mar 29 09:38:24 pstn1 kernel:  [c04284f1] current_fs_time+0x4a/0x55
 Mar 29 09:38:24 pstn1 kernel:  [c0488f9b] touch_atime+0x60/0x91
 Mar 29 09:38:24 pstn1 kernel:  [c047d9d0] pipe_readv+0x315/0x321
 Mar 29 09:38:24 pstn1 kernel:  [c05acde4] sys_socketcall+0x106/0x19e
 Mar 29 09:38:24 pstn1 kernel:  [c0404f17] syscall_call+0x7/0xb
 Mar 29 09:38:24 pstn1 kernel:  ===


 This occurred during a high load period (52 calls across 3 PRI spans).

 A couple days ago I moved the interrupts for my PRI card to CPU0 from
 CPU3, because CPU3 was handling everything else:
           CPU0       CPU1       CPU2       CPU3
  0:        306          0          0 3684057379    IO-APIC-edge  timer
  1:          0          0          0      13468    IO-APIC-edge  i8042
  8:          0          0          0          3    IO-APIC-edge  rtc
  9:          0          0          0          0   IO-APIC-level  acpi
  12:          0          0          0          4    IO-APIC-edge  i8042
 169:          0          0          0          0   IO-APIC-level
  uhci_hcd:usb2
 177:          0          0          0   18392593   IO-APIC-level  ata_piix
 185:          0          0          0          1   IO-APIC-level
  ehci_hcd:usb1
 193:          0          0          0          0   IO-APIC-level
  uhci_hcd:usb3
 201:          0          0          0 2090021759   IO-APIC-level  eth0
 209:  149621223          0          0 3534419461   IO-APIC-level  wct4xxp


 (The CPU3 number for wct4xxp is not increasing any more).

 What is the interrupt distribution of other people's systems?
 Before I made this change I was having a problem with