Re: em network issues
On Thu, Oct 19, 2006 at 02:18:13PM -0700, Jack Vogel wrote: J The engineer in our test group has installed 6.2 BETA2 and attempted via a J number of tests to reproduce this problem, the machine even shares the em J interrupt with usb, and yet so far he has been unsuccessful. I've failed to reproduce on a system where IRQ was shared between em(4) and fxp(4). I've put traffic on both, but failed to reproduce. Probably shared IRQ is required, but not sufficient. -- Totus tuus, Glebius. GLEBIUS-RIPN GLEB-RIPE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em network issues
On Thu, Oct 19, 2006 at 12:18:16PM -0700, Jeremy Chadwick wrote: J A bit more helpfull, but unfortunately not much is a datapoint saying no J problems April 3rd and watchdog timeouts after September 28 RELENG_6. I J know, probably too vague to be of any use, but there it is. J J Someone else has already discussed the date of the commit which J supposedly broke this. Here is the exact post in the exact thread J discussing this: J J http://lists.freebsd.org/pipermail/freebsd-stable/2006-October/029094.html Yes, this merge was awesome in its volume. However, it is suspected that this part of merge causes the problem: o a significant performance improvements. the interrupt handler schedules work to a private taskqueue. the em_rxeof() function runs lockless. rev. 1.98 - 1.101 by scottl. To check whether this is true or not, one needs to build kernel with em(4) static in the kernel and with DEVICE_POLLING option. One shouldn't turn polling(4) on em(4), but option must be present in kernel config. In this case the driver will use interrupt driven module, but with old style interrupt handler, that doesn't make use of taskqueue. I'd appreciate if people who are observing the problem will report whether adding DEVICE_POLLING option to kernel config helps them or not. This will help to tell whether the problem is in the above quote or in the import of new versions from vendor. -- Totus tuus, Glebius. GLEBIUS-RIPN GLEB-RIPE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: locked vnode / nfs... requires kill -9 in ddb
Kostik Belousov wrote at 06:57 +0300 on Oct 21, 2006: On Sat, Oct 21, 2006 at 08:25:00AM +0800, David Xu wrote: On Thursday 19 October 2006 18:04, Kostik Belousov wrote: The nfs_reply is sleeping with the PCATCH set. The question is why SIGTSTP does not cause msleep to return with EINTR. I have not been tracking the thread. but if the thread is sleeping with PCATCH, the SIGTSTP should cause the process to stop unless the signal is masked by sigprocmask or the signal has an action handler been set, this is a correct behavior. David, as I understand the report, the following happens. The nfs mount point with intr option issued the request and waits for the reply. Some vnode locks are held while waiting. Code needs to catch the signals to abort the operation on user request. It uses msleep with PCATCH. The thread in question has td_locks 0. The SIGTSTP is delivered, and thread is stopped, while holding vnode lock. How this situation shall be handled ? Namely, how to sleep while having the ability to safely clean up on attempt of stopping ? Masking SIGTSTP is not the option, due to SIGSTOP having the same results and not being blockable. [Would it be right to stop the threads only on returning from kernel to user mode ?] David, here's the original report. http://lists.freebsd.org/pipermail/freebsd-stable/2006-October/029755.html Indeed, as Kostik surmised, the mount point is mounted intr. I did not notice this problem while running with releng_6 from late June for 3 months. Could it be this problem was introduced between then and now? This also just happened today on a system I just updated from 5.3 to 5.5-p8. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: locked vnode / nfs... requires kill -9 in ddb
On Saturday 21 October 2006 14:56, John E Hein wrote: David, here's the original report. http://lists.freebsd.org/pipermail/freebsd-stable/2006-October/029755.html Indeed, as Kostik surmised, the mount point is mounted intr. I did not notice this problem while running with releng_6 from late June for 3 months. Could it be this problem was introduced between then and now? This also just happened today on a system I just updated from 5.3 to 5.5-p8. This is also RELENG_4's behavior, if PCATCH is set, the tsleep will call CURSIG() which will suspends current process if there is a SIGTSTP or SIGSTOP signal. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: locked vnode / nfs... requires kill -9 in ddb
On Saturday 21 October 2006 11:57, Kostik Belousov wrote: On Sat, Oct 21, 2006 at 08:25:00AM +0800, David Xu wrote: On Thursday 19 October 2006 18:04, Kostik Belousov wrote: The nfs_reply is sleeping with the PCATCH set. The question is why SIGTSTP does not cause msleep to return with EINTR. I have not been tracking the thread. but if the thread is sleeping with PCATCH, the SIGTSTP should cause the process to stop unless the signal is masked by sigprocmask or the signal has an action handler been set, this is a correct behavior. David, as I understand the report, the following happens. The nfs mount point with intr option issued the request and waits for the reply. Some vnode locks are held while waiting. Code needs to catch the signals to abort the operation on user request. It uses msleep with PCATCH. The thread in question has td_locks 0. The SIGTSTP is delivered, and thread is stopped, while holding vnode lock. How this situation shall be handled ? Namely, how to sleep while having the ability to safely clean up on attempt of stopping ? Masking SIGTSTP is not the option, due to SIGSTOP having the same results and not being blockable. [Would it be right to stop the threads only on returning from kernel to user mode ?] I know in the case, you want signal to interrupt the thread but don't want a job control signal to suspend the thread, but a PCATCH flag is not enough to tell the case. I think we are trying to fix the history problem of RELENG_4 or earlier. David Xu ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: locked vnode / nfs... requires kill -9 in ddb
David Xu wrote at 15:10 +0800 on Oct 21, 2006: On Saturday 21 October 2006 14:56, John E Hein wrote: David, here's the original report. http://lists.freebsd.org/pipermail/freebsd-stable/2006-October/029755.html Indeed, as Kostik surmised, the mount point is mounted intr. I did not notice this problem while running with releng_6 from late June for 3 months. Could it be this problem was introduced between then and now? This also just happened today on a system I just updated from 5.3 to 5.5-p8. This is also RELENG_4's behavior, if PCATCH is set, the tsleep will call CURSIG() which will suspends current process if there is a SIGTSTP or SIGSTOP signal. Great. Suspending the process is what I expect when I hit ctrl-z. Hanging access to the filesystem isn't. ;) Nor have I had this problem when running 4.11 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: locked vnode / nfs... requires kill -9 in ddb
On Saturday 21 October 2006 15:22, John E Hein wrote: David Xu wrote at 15:10 +0800 on Oct 21, 2006: This is also RELENG_4's behavior, if PCATCH is set, the tsleep will call CURSIG() which will suspends current process if there is a SIGTSTP or SIGSTOP signal. Great. Suspending the process is what I expect when I hit ctrl-z. Hanging access to the filesystem isn't. ;) Nor have I had this problem when running 4.11 Do you know the NFS code have not been changed since then ? ;-) ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: When will the new BCE driver in HEAD be incorporated into RELENG_6?
Bill Moran wrote: In response to Jason Thomson [EMAIL PROTECTED]: Scott Long wrote: Conrad Burger wrote: Hi It looks like there is a new version of the bce driver in HEAD. When will it be incorporated into Releng_6? It will be merged when someone, preferably 2-3 people, tell me that the changes in HEAD work for them. So far, no one has. Scott ___ Using the driver from HEAD* in the latest RELENG_6 didn't fix our problems. We could still trigger the Watchdog timeout when copying a local file to an NFS mounted filesystem (UDP mount, GigE speeds). Same here, although it seemed to require a lot more effort to produce the problem. I see there have been additional updates to the driver in the past 10 hours. I'll grab those and try again to see if they help any. I'll cautiously say that the latest set of changes to bce might be interesting for you. It now survives all of my ttcp and netblast tests in both UDP and TCP. Please let me know if it makes things better, worse, or no change for you. Scott ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: When will the new BCE driver in HEAD be incorporated into RELENG_6?
I'm not really a Broadcom developer, but I play one on TV =-) Scott Kevin Kramer wrote: Sorry, I thought that you and others were working on numerous Broadcom issues including incorrect recognition of the chipsets for the Poweredge 1950's and Precision 390. You had been responding to most of the threads regarding Broadcom issues. -- Kevin Kramer Sr. Systems Administrator 512.418.5725 Centaur Technology, Inc. www.centtech.com Scott Long wrote the following on 10/18/06 10:26: Kevin Kramer wrote: and will it support the BCM5754 in the Precision 390? No idea, ask the vendor. Scott ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Hard lock on 6.1-STABLE
Am 21.10.2006 um 01:05 schrieb Laurence Sanford: The switch rarely shows any collisions unless network load is high on this box, then the collision light will come on nearly constantly. Depending on what exactly the collision led signifies on your switch, this might indicate a problem. On a full-duplex link, there cannot be any collisions by definition. You should make sure that your switch and nve0 are both full-duplex. If your collision led indicates back-pressure being applied to the port, then it might be OK. Stefan -- Stefan Bethke [EMAIL PROTECTED] Fon +49 170 346 0140 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: locked vnode / nfs... requires kill -9 in ddb
David Xu wrote at 15:46 +0800 on Oct 21, 2006: On Saturday 21 October 2006 15:22, John E Hein wrote: David Xu wrote at 15:10 +0800 on Oct 21, 2006: This is also RELENG_4's behavior, if PCATCH is set, the tsleep will call CURSIG() which will suspends current process if there is a SIGTSTP or SIGSTOP signal. Great. Suspending the process is what I expect when I hit ctrl-z. Hanging access to the filesystem isn't. ;) Nor have I had this problem when running 4.11 Do you know the NFS code have not been changed since then ? ;-) A quick glance shows me it hasn't changed much from 4.11... some large fs changes in nfs, some kqueue stuff in kern, not much. No changes at all it seems in the area we're talking about. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em network issues
On Sat, Oct 21, 2006 at 12:57:52PM -0400, Kris Kennaway wrote: K On Sat, Oct 21, 2006 at 10:17:06AM +0400, Gleb Smirnoff wrote: K On Thu, Oct 19, 2006 at 02:18:13PM -0700, Jack Vogel wrote: K J The engineer in our test group has installed 6.2 BETA2 and attempted via a K J number of tests to reproduce this problem, the machine even shares the em K J interrupt with usb, and yet so far he has been unsuccessful. K K I've failed to reproduce on a system where IRQ was shared between em(4) K and fxp(4). I've put traffic on both, but failed to reproduce. Probably K shared IRQ is required, but not sufficient. K K Note what I've said a couple of times now...blasting packets out over K the shared em doesn't trigger it for me either. I can trigger it by K fetching via FTP from a remote machine over the em. I suppose the TCP_STREAM test from the netperf is the same as ftp fetch. -- Totus tuus, Glebius. GLEBIUS-RIPN GLEB-RIPE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em network issues
On Sat, Oct 21, 2006 at 01:00:08PM -0400, Mikhail Teterin wrote: M = I'd appreciate if people who are observing the problem will report M = whether adding DEVICE_POLLING option to kernel config helps them M = or not. This will help to tell whether the problem is in the above M = quote or in the import of new versions from vendor. M M I tried this yesterday -- before writing to [EMAIL PROTECTED] I saw the system component M of the total load being rather high and then enabled polling. M M Again, I did not wait long enough to check, whether the system will cease M communicating completely before enabling polling on em, but the system load M was shooting way up upon starting my backup procedure even when I switched to M DEVICE_POLLING-using kernel. We aren't currently speaking about performance, we need to know whether kernel with DEVICE_POLLING option makes NIC work stable. -- Totus tuus, Glebius. GLEBIUS-RIPN GLEB-RIPE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em network issues
= I'd appreciate if people who are observing the problem will report = whether adding DEVICE_POLLING option to kernel config helps them = or not. This will help to tell whether the problem is in the above = quote or in the import of new versions from vendor. I tried this yesterday -- before writing to [EMAIL PROTECTED] I saw the system component of the total load being rather high and then enabled polling. Again, I did not wait long enough to check, whether the system will cease communicating completely before enabling polling on em, but the system load was shooting way up upon starting my backup procedure even when I switched to DEVICE_POLLING-using kernel. -mi ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em network issues
On Sat, Oct 21, 2006 at 10:17:06AM +0400, Gleb Smirnoff wrote: On Thu, Oct 19, 2006 at 02:18:13PM -0700, Jack Vogel wrote: J The engineer in our test group has installed 6.2 BETA2 and attempted via a J number of tests to reproduce this problem, the machine even shares the em J interrupt with usb, and yet so far he has been unsuccessful. I've failed to reproduce on a system where IRQ was shared between em(4) and fxp(4). I've put traffic on both, but failed to reproduce. Probably shared IRQ is required, but not sufficient. Note what I've said a couple of times now...blasting packets out over the shared em doesn't trigger it for me either. I can trigger it by fetching via FTP from a remote machine over the em. Kris pgphgvSfu7BFW.pgp Description: PGP signature
Re: 5 to 6
for the record, i followed the recipe in UPDATING and it worked. randy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em network issues
On Sat, Oct 21, 2006 at 09:32:50PM +0400, Gleb Smirnoff wrote: On Sat, Oct 21, 2006 at 12:57:52PM -0400, Kris Kennaway wrote: K On Sat, Oct 21, 2006 at 10:17:06AM +0400, Gleb Smirnoff wrote: K On Thu, Oct 19, 2006 at 02:18:13PM -0700, Jack Vogel wrote: K J The engineer in our test group has installed 6.2 BETA2 and attempted via a K J number of tests to reproduce this problem, the machine even shares the em K J interrupt with usb, and yet so far he has been unsuccessful. K K I've failed to reproduce on a system where IRQ was shared between em(4) K and fxp(4). I've put traffic on both, but failed to reproduce. Probably K shared IRQ is required, but not sufficient. K K Note what I've said a couple of times now...blasting packets out over K the shared em doesn't trigger it for me either. I can trigger it by K fetching via FTP from a remote machine over the em. I suppose the TCP_STREAM test from the netperf is the same as ftp fetch. Maybe, but why not just try fetch? ;) FYI, I still get timeouts from fetch on a shared em/fxp irq with DEVICE_POLLING configured. Turning on INTR_MPSAFE in the driver is the only known fix so far for this condition. Kris pgpCp5xOLZWT4.pgp Description: PGP signature
Panic on DOH! ata_alloc_request failed!
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hey all, (I'm off list, please include me in any replies) (Søren, please let me know if you do not want to be emailed in the future directly! You seem to be the ATA RAID FreeBSD goto guy. Apologies if you did not want to be solicited). I just received a panic w/ this in /var/log/messages (uname and dmesg at the end of the email / book): Oct 21 04:17:05 prometheus kernel: DOH! ata_alloc_composite failed! The system would have been performing a weekly dump (live snapshot) at the time. The motherboard is a SuperMicro P4SCi [1], which uses an Intel 6300ESB onboard SATA w/ RAID 0/1 support (Adaptec). I have one RAID volume created, ar0 as a RAID 1 between two 300GB disks. I previously have seen this error when we were importing a MySQL database on the box w/ heavy disk I/O. Oct 9 21:41:23 prometheus kernel: DOH! ata_alloc_request failed! Oct 9 21:41:29 prometheus kernel: FAILURE - out of memory in ata_raid_init_requ est Oct 9 21:41:29 prometheus last message repeated 2 times Oct 9 21:41:29 prometheus kernel: g_vfs_done():ar0s1e[WRITE (offset=108675514368 , length=16384)]error = 5 Oct 9 21:41:29 prometheus kernel: g_vfs_done():ar0s1e[WRITE (offset=108486344704 , length=16384)]error = 5 Oct 9 21:41:29 prometheus kernel: g_vfs_done():ar0s1e[WRITE (offset=108486623232 , length=16384)]error = 5 Oct 9 23:01:17 prometheus kernel: FAILURE - out of memory in ata_raid_init_requ est Oct 9 23:01:17 prometheus kernel: FAILURE - out of memory in ata_raid_init_requ est Oct 9 23:01:17 prometheus kernel: g_vfs_done():ar0s1e[WRITE (offset=118889250816 , length=16384)]error = 5 Oct 9 23:01:17 prometheus kernel: g_vfs_done():ar0s1e[WRITE (offset=118889742336 , length=16384)]error = 5 I've seen two other similar issues on the mailing lists: http://lists.freebsd.org/pipermail/freebsd-amd64/2006-August/008770.html http://lists.freebsd.org/pipermail/freebsd-amd64/2006-April/008047.html http://lists.freebsd.org/pipermail/freebsd-stable/2005-November/ 019559.html The first two didn't seem to have any follow-ups. This is the only bug report with ata_alloc_composite returned in the query: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/89310 I restarted the box, manually verified the array using the BIOS utils (12 errors fixed, eek!), and had to run fsck manually due to an UNEXPECTED SOFT UPDATE INCONSISTENCY on /dev/ar0s1e. I had unreferenced files, too (yeah for backups). Since I didn't have a dumpon device, I don't have a core to post. Here's the grep of sysctl alluded to in the Nov. 2005 email above: sysctl -a | grep ^ata_ ata_composit:196,0, 0,100, 3928 ata_request: 204,0, 0, 76, 127512 How do I monitor or fix this? 'sync' or restart every so often? Any assistance would be appreciated! TIA, Jon [1] http://www.supermicro.com/products/motherboard/P4/E7210/P4SCi.cfm uname -a FreeBSD prometheus.int.hursk.com 6.1-RELEASE FreeBSD 6.1-RELEASE #0: Sun May 7 04:32:43 UTC 2006 [EMAIL PROTECTED]:/usr/obj/ usr/src/sys/GENERIC i386 (who believes in patches ;-) dmesg afterboot=yes Copyright (c) 1992-2006 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 6.1-RELEASE #0: Sun May 7 04:32:43 UTC 2006 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC ACPI APIC Table: IntelR AWRDACPI Timecounter i8254 frequency 1193182 Hz quality 0 CPU: Intel(R) Pentium(R) 4 CPU 2.80GHz (2795.24-MHz 686-class CPU) Origin = GenuineIntel Id = 0xf29 Stepping = 9 Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE ,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE Features2=0x4400CNTX-ID,b14 Logical CPUs per core: 2 real memory = 1072562176 (1022 MB) avail memory = 1040633856 (992 MB) ioapic0: Changing APIC ID to 2 ioapic0 Version 2.0 irqs 0-23 on motherboard ioapic1 Version 2.0 irqs 24-47 on motherboard kbd1 at kbdmux0 acpi0: IntelR AWRDACPI on motherboard acpi0: Power Button (fixed) Timecounter ACPI-fast frequency 3579545 Hz quality 1000 acpi_timer0: 24-bit timer at 3.579545MHz port 0x408-0x40b on acpi0 cpu0: ACPI CPU on acpi0 acpi_button0: Power Button on acpi0 pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0 pci0: ACPI PCI bus on pcib0 pcib1: ACPI PCI-PCI bridge at device 3.0 on pci0 pci1: ACPI PCI bus on pcib1 em0: Intel(R) PRO/1000 Network Connection Version - 3.2.18 port 0xb000-0xb01f mem 0xf230-0xf231 irq 18 at device 1.0 on pci1 em0: Ethernet address: 00:30:48:84:01:22 pcib2: ACPI PCI-PCI bridge at device 28.0 on pci0 pci2: ACPI PCI bus on pcib2 pcib3: PCI-PCI bridge at device 1.0 on pci2 pci3: PCI bus on pcib3 sf0: Adaptec ANA-62044 10/100BaseTX port 0xc000-0xc0ff mem 0xf200-0xf207 irq 24 at device 4.0 on pci3 miibus0: MII bus on sf0 ukphy0: Generic IEEE