Re: twa: Passthru request timed out! Resetting controller...
Mark Dotson said the following on 11/14/06 1:18 PM: I've had continued problems with the 3ware series SATA cards and the Tyan boards. Specifically, I have a Tyan S5360-1U and both a 9500S-4LP and a 8506 series 3ware cards. In my case the first error is different, but the 'resetting' over and over is VERY familiar. This could be triggered by a simple file copy from one part of a container to another; degrading the unit and triggering the resetting crap. Note that the drives are fine, I tested that first thing. Sep 8 11:59:23 localhost kernel: 3w-9xxx: scsi0: WARNING: (0x06:0x002C): Unit #1: Command (0x2a) timed out, resetting card. Sep 8 11:59:41 localhost kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronized after power fail:unit=0. Sep 8 11:59:41 localhost kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronized after power fail:unit=1. I also found this problem to exist across platforms, not just FreeBSD. For example, the excerpt above is from a CentOS box. All tests were done with newest firmware for both card and mobo, and using the newest drivers provided by 3ware. Once I removed the card and drives from the Tyan system and stuck them in pretty much ANY other system, they worked fantastically. I don't have an answer for the resetting problem as of yet... 3ware and Tyan (And my system vendor Appro) are still trying to find my specific problem and solve it. I believe they are currently doing the replace everything method of troubleshooting. Mark, thank you. It's good to know that the resetting problem exist on other platforms too. We already found out that replacing the entire box with identical one doesn't help, so unfortunately we'll have to start replacing components by using different brands or models. I wouldn't like to touch the I/O subsystem (these are already loaded production machines), so like you said, the safest bet would be to try another motherboard. However I don't see many Dual Opteron based boards suggested by the 3ware's compatibility list. The next one that comes in mind from that list is Supermicro H8DC8, but it looks more like a gamers dream (High-End PCI-e Graphics, SLI, etc. but no on-board VGA) than a server board. I'm quite surprised that the top Opteron based motherboard manufacturer listed in the 3ware web site motherboard compatibility docs: http://3ware.com/products/pdf/Motherboard_compatibility_list_9550SX_2006_06.pdf makes 2 out of 5 boards that are marked as compatible, but perform so bad with 3ware cards. I know what happens here in this mailing list when somebody looks for good SATA cards (Re: 3ware, 3ware, ...), I replied myself too. So are there any success stories with 3ware 9550SX (SATA II) and dual AMD Opteron server boards, or it's time to go back with Intel? Regards, Atanas Atanas wrote: Has anyone experiencing this: twa0: ERROR: (0x05: 0x2018): Passthru request timed out!: request = 0xca839d20 twa0: INFO: (0x16: 0x1108): Resetting controller...: twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=0 ... twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=7 twa0: INFO: (0x04: 0x0001): Controller reset occurred: resets=1 twa0: INFO: (0x16: 0x1107): Controller reset done!: This happens on 6.2-PRERELEASE i386 (and on 6.1 since its release) on a number of machines with the following hardware configuration: - Tyan K8SE 2892, 2 AMD Opteron 270 CPUs, 4GB RAM - 3ware 9550SX-8LP, 8 500GB Seagate ST3500641AS SATA drives (configured as 8 SINGLE DISK units, aka JBOD) All hardware components, including the server chassis, are listed in the 3ware hardware compatibility lists. It doesn't seem to be a cabling or power issue. The controller and hard drives are already flashed to the latest firmware revisions. I tried turning off NCQ, but it didn't make any difference. I tried also switching the kernel from PAE to non-PAE (reducing the usable memory to 3GB), but it didn't help either. I have another machines with similar I/O configurations (3ware), but with Intel motherboards and running FreeBSD-5.5, and these run fine for about a year already. Now I'm thinking about swapping the drives between a working Intel and AMD based box, to see where controller timeouts will follow. The problem happens sporadically once in a month or so and is very hard to reproduce. Sometimes it takes several weeks until the next crash happens, sometimes it crashes again in just a few hours. When the thing happens, the kernel sometimes panics (most likely due to the inconsistent filesystem state caused by the controller reset), sometimes just hangs. It can be interrupted (I have a serial console), but the only usable thing after that seems to be call cpu_reset(), followed by full (and sometimes painfully long) filesystem check. Here are the diffs against the default GENERIC and PAE kernel configurations: cpu I486_CPU ident GENERIC
twa: Passthru request timed out! Resetting controller...
Has anyone experiencing this: twa0: ERROR: (0x05: 0x2018): Passthru request timed out!: request = 0xca839d20 twa0: INFO: (0x16: 0x1108): Resetting controller...: twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=0 ... twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=7 twa0: INFO: (0x04: 0x0001): Controller reset occurred: resets=1 twa0: INFO: (0x16: 0x1107): Controller reset done!: This happens on 6.2-PRERELEASE i386 (and on 6.1 since its release) on a number of machines with the following hardware configuration: - Tyan K8SE 2892, 2 AMD Opteron 270 CPUs, 4GB RAM - 3ware 9550SX-8LP, 8 500GB Seagate ST3500641AS SATA drives (configured as 8 SINGLE DISK units, aka JBOD) All hardware components, including the server chassis, are listed in the 3ware hardware compatibility lists. It doesn't seem to be a cabling or power issue. The controller and hard drives are already flashed to the latest firmware revisions. I tried turning off NCQ, but it didn't make any difference. I tried also switching the kernel from PAE to non-PAE (reducing the usable memory to 3GB), but it didn't help either. I have another machines with similar I/O configurations (3ware), but with Intel motherboards and running FreeBSD-5.5, and these run fine for about a year already. Now I'm thinking about swapping the drives between a working Intel and AMD based box, to see where controller timeouts will follow. The problem happens sporadically once in a month or so and is very hard to reproduce. Sometimes it takes several weeks until the next crash happens, sometimes it crashes again in just a few hours. When the thing happens, the kernel sometimes panics (most likely due to the inconsistent filesystem state caused by the controller reset), sometimes just hangs. It can be interrupted (I have a serial console), but the only usable thing after that seems to be call cpu_reset(), followed by full (and sometimes painfully long) filesystem check. Here are the diffs against the default GENERIC and PAE kernel configurations: cpu I486_CPU ident GENERIC options INET6 # IPv6 communications protocols options SCSI_DELAY=5000 # Delay (in ms) before probing SCSI options QUOTA options SMP # Symmetric MultiProcessor Kernel options BREAK_TO_DEBUGGER options DDB options KDB options KDB_UNATTENDED options IPFIREWALL options DUMMYNET I'm attaching the dmesg.boot following the latest crash. Regards, Atanas Copyright (c) 1992-2006 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 6.2-PRERELEASE #0: Mon Nov 13 17:47:40 PST 2006 [EMAIL PROTECTED]:/var/obj/usr/src/sys/XYZ-PAE Timecounter i8254 frequency 1193182 Hz quality 0 CPU: Dual Core AMD Opteron(tm) Processor 270 (2009.27-MHz 686-class CPU) Origin = AuthenticAMD Id = 0x20f12 Stepping = 2 Features=0x178bfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT Features2=0x1SSE3 AMD Features=0xe2500800SYSCALL,NX,MMX+,FFXSR,LM,3DNow+,3DNow AMD Features2=0x3LAHF,CMP Cores per package: 2 real memory = 5368709120 (5120 MB) avail memory = 4182241280 (3988 MB) ACPI APIC Table: PTLTD APIC FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 cpu2 (AP): APIC ID: 2 cpu3 (AP): APIC ID: 3 ioapic0 Version 1.1 irqs 0-23 on motherboard ioapic1 Version 1.1 irqs 24-27 on motherboard ioapic2 Version 1.1 irqs 28-31 on motherboard kbd1 at kbdmux0 acpi0: PTLTD RSDT on motherboard acpi0: Power Button (fixed) Timecounter ACPI-fast frequency 3579545 Hz quality 1000 acpi_timer0: 24-bit timer at 3.579545MHz port 0x8008-0x800b on acpi0 cpu0: ACPI CPU on acpi0 cpu1: ACPI CPU on acpi0 cpu2: ACPI CPU on acpi0 cpu3: ACPI CPU on acpi0 acpi_button0: Power Button on acpi0 pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0 pci0: ACPI PCI bus on pcib0 pci0: memory at device 0.0 (no driver attached) isab0: PCI-ISA bridge at device 1.0 on pci0 isa0: ISA bus on isab0 pci0: serial bus, SMBus at device 1.1 (no driver attached) pci0: serial bus, USB at device 2.0 (no driver attached) atapci0: nVidia nForce CK804 UDMA133 controller port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x1400-0x140f at device 6.0 on pci0 ata0: ATA channel 0 on atapci0 ata1: ATA channel 1 on atapci0 pcib1: ACPI PCI-PCI bridge at device 9.0 on pci0 pci1: ACPI PCI bus on pcib1 pci1: display, VGA at device 6.0 (no driver attached) fxp0: Intel 82551 Pro/100 Ethernet port 0x2400-0x243f mem 0xda101000-0xda101fff,0xda12-0xda13 irq 16 at device 8.0 on pci1 miibus0: MII bus on fxp0 inphy0: i82555 10/100 media interface on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto fxp0: Ethernet address: 00
Re: twa: Passthru request timed out! Resetting controller...
adam radford said the following on 11/14/06 11:56 AM: Are you running the latest 3ware firmware on that controller? Yep. It's in dmesg.boot: twa0: INFO: (0x15: 0x1300): Controller details:: Model 9550SX-8LP, 8 ports, Firmware FE9X 3.04.01.011, BIOS BE9X 3.04.00.002 That's the latest one released as 9.3.0.7 on the 3ware website. Yesterday flashed and rebooted them all, and this morning I got the next crash. Regards, Atanas On 11/14/06, Atanas [EMAIL PROTECTED] wrote: Has anyone experiencing this: twa0: ERROR: (0x05: 0x2018): Passthru request timed out!: request = 0xca839d20 twa0: INFO: (0x16: 0x1108): Resetting controller...: twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=0 ... twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=7 twa0: INFO: (0x04: 0x0001): Controller reset occurred: resets=1 twa0: INFO: (0x16: 0x1107): Controller reset done!: This happens on 6.2-PRERELEASE i386 (and on 6.1 since its release) on a number of machines with the following hardware configuration: - Tyan K8SE 2892, 2 AMD Opteron 270 CPUs, 4GB RAM - 3ware 9550SX-8LP, 8 500GB Seagate ST3500641AS SATA drives (configured as 8 SINGLE DISK units, aka JBOD) All hardware components, including the server chassis, are listed in the 3ware hardware compatibility lists. It doesn't seem to be a cabling or power issue. The controller and hard drives are already flashed to the latest firmware revisions. I tried turning off NCQ, but it didn't make any difference. I tried also switching the kernel from PAE to non-PAE (reducing the usable memory to 3GB), but it didn't help either. I have another machines with similar I/O configurations (3ware), but with Intel motherboards and running FreeBSD-5.5, and these run fine for about a year already. Now I'm thinking about swapping the drives between a working Intel and AMD based box, to see where controller timeouts will follow. The problem happens sporadically once in a month or so and is very hard to reproduce. Sometimes it takes several weeks until the next crash happens, sometimes it crashes again in just a few hours. When the thing happens, the kernel sometimes panics (most likely due to the inconsistent filesystem state caused by the controller reset), sometimes just hangs. It can be interrupted (I have a serial console), but the only usable thing after that seems to be call cpu_reset(), followed by full (and sometimes painfully long) filesystem check. Here are the diffs against the default GENERIC and PAE kernel configurations: cpu I486_CPU ident GENERIC options INET6 # IPv6 communications protocols options SCSI_DELAY=5000 # Delay (in ms) before probing SCSI options QUOTA options SMP # Symmetric MultiProcessor Kernel options BREAK_TO_DEBUGGER options DDB options KDB options KDB_UNATTENDED options IPFIREWALL options DUMMYNET I'm attaching the dmesg.boot following the latest crash. Regards, Atanas Copyright (c) 1992-2006 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 6.2-PRERELEASE #0: Mon Nov 13 17:47:40 PST 2006 [EMAIL PROTECTED]:/var/obj/usr/src/sys/XYZ-PAE Timecounter i8254 frequency 1193182 Hz quality 0 CPU: Dual Core AMD Opteron(tm) Processor 270 (2009.27-MHz 686-class CPU) Origin = AuthenticAMD Id = 0x20f12 Stepping = 2 Features=0x178bfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT Features2=0x1SSE3 AMD Features=0xe2500800SYSCALL,NX,MMX+,FFXSR,LM,3DNow+,3DNow AMD Features2=0x3LAHF,CMP Cores per package: 2 real memory = 5368709120 (5120 MB) avail memory = 4182241280 (3988 MB) ACPI APIC Table: PTLTD APIC FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 cpu2 (AP): APIC ID: 2 cpu3 (AP): APIC ID: 3 ioapic0 Version 1.1 irqs 0-23 on motherboard ioapic1 Version 1.1 irqs 24-27 on motherboard ioapic2 Version 1.1 irqs 28-31 on motherboard kbd1 at kbdmux0 acpi0: PTLTD RSDT on motherboard acpi0: Power Button (fixed) Timecounter ACPI-fast frequency 3579545 Hz quality 1000 acpi_timer0: 24-bit timer at 3.579545MHz port 0x8008-0x800b on acpi0 cpu0: ACPI CPU on acpi0 cpu1: ACPI CPU on acpi0 cpu2: ACPI CPU on acpi0 cpu3: ACPI CPU on acpi0 acpi_button0: Power Button on acpi0 pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0 pci0: ACPI PCI bus on pcib0 pci0: memory at device 0.0 (no driver attached) isab0: PCI-ISA bridge at device 1.0 on pci0 isa0: ISA bus on isab0 pci0: serial bus, SMBus at device 1.1 (no driver attached) pci0: serial bus, USB at device 2.0 (no driver attached) atapci0: nVidia nForce CK804 UDMA133 controller port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x1400-0x140f at device 6.0 on pci0 ata0: ATA channel
Re: QUOTA+SNAPSHOT
Alexey Karagodov said the following on 8/14/06 10:17 AM: hi everybody! is QUOTA+SNAPSHOT problem solved? any work-around? also i was tried to dd if=/dev/zero of=/var/swap/swap.bin bs=1m count=32768 and server stop serving any requests except ping and scroll-lock in console on 18983936K of writed data. my dmesg.boot: Copyright (c) 1992-2006 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 6.1-RELEASE-p3 #1: Tue Jul 18 15:02:07 MSD 2006 Not in 6.1-RELEASE AFAIK. I believe there are many fixes in 6-STABLE, however I haven't had a chance to test the snapshots so far. I can say that for me 6-STABLE behaves not less stable that 6.1-RELEASE-p3. I had 3 production machines (dual dual-core Opterons) running 6.1-R/i386+PAE and crashing once a week or so with QUOTAS enabled and SNAPSHOTS disabled (i.e. background_fsck=NO in rc.conf). Since upgrading to 6-STABLE I got 10 days of uptime and no crashes so far. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: portupgrade bug: -M no longer works after v2.1.0
Sergey Matveychuk said the following on 7/12/06 10:57 PM: Atanas wrote: Please, don't get me wrong. I'm not asking for help or for a workaround. I'm actually trying to help identifying a problem or regression. If this is not a bug, but a feature change, please have it documented. It was a bug. Fixed. Thanks. Thank you! Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: portupgrade bug: -M no longer works after v2.1.0
Sergey Matveychuk said the following on 7/11/2006 10:08 PM: Atanas wrote: Recent portupgrade versions no longer obey the -M command line switch, i.e. any optional arguments to be prepended to each make command. How to reproduce: # portinstall -M APACHE_HARD_SERVER_LIMIT=1024 www/apache13 Everything work file. Use -m for getting what you want. For www/apache13 the -m switch could give the same result as -M would, but I'm not sure whether it's not just a coincidence. The -m switch was supposed to serve a different purpose: -m --make-argsSpecify arguments to append to each make(1) com- mand line. -M --make-env Specify arguments to prepend to each make(1) com- mand line. I tried testing another port where I used both: # portinstall -M 'WITH_SYSLOG_FACILITY=local5' -m '-DWITHOUT_IPV6' mail/courier-imap With portupgrade-2.0.1_1,1 (the stock 6.1-RELEASE package) it worked. With portupgrade-2.1.3.2,2 it failed (ignoring the -M part like for www/apache13 before). Then I joined both in one -m switch: # portinstall -m 'WITH_SYSLOG_FACILITY=local5 -DWITHOUT_IPV6' mail/courier-imap and the latest portupgrade-2.1.3.2,2 did it just fine. So, like you suggested, the -m switch seems to cover the functionality that -M used to provide. I'm not sure however whether this prepend to append conversion would work for all ports. But for these that I use it appears to work, so I have no problem and will update my scripts to use -m only. The no longer working (obsolete?) -M switch would need to be removed from the man page though. Thanks, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: portupgrade bug: -M no longer works after v2.1.0
Sergey Matveychuk said the following on 7/12/06 3:24 AM: Atanas wrote: Sergey Matveychuk said the following on 7/11/2006 10:08 PM: Atanas wrote: Recent portupgrade versions no longer obey the -M command line switch, i.e. any optional arguments to be prepended to each make command. How to reproduce: # portinstall -M APACHE_HARD_SERVER_LIMIT=1024 www/apache13 Everything work file. Use -m for getting what you want. For www/apache13 the -m switch could give the same result as -M would, but I'm not sure whether it's not just a coincidence. The -m switch was supposed to serve a different purpose: -m --make-argsSpecify arguments to append to each make(1) com- mand line. -M --make-env Specify arguments to prepend to each make(1) com- mand line. I tried testing another port where I used both: # portinstall -M 'WITH_SYSLOG_FACILITY=local5' -m '-DWITHOUT_IPV6' mail/courier-imap With portupgrade-2.0.1_1,1 (the stock 6.1-RELEASE package) it worked. With portupgrade-2.1.3.2,2 it failed (ignoring the -M part like for www/apache13 before). Then I joined both in one -m switch: # portinstall -m 'WITH_SYSLOG_FACILITY=local5 -DWITHOUT_IPV6' mail/courier-imap and the latest portupgrade-2.1.3.2,2 did it just fine. So, like you suggested, the -m switch seems to cover the functionality that -M used to provide. I'm not sure however whether this prepend to append conversion would work for all ports. But for these that I use it appears to work, so I have no problem and will update my scripts to use -m only. The no longer working (obsolete?) -M switch would need to be removed from the man page though. Both -m and -M works fine but do different things. -m pass its argument as make file argument(s) and -M pass its argument as environment variable(s). You can't set make variable with environment variable. They are different! Yes, I know that. That's why I quoted the man page in my previous post (see above). -M has never worked as you expected. I expected -M to work exactly as documented, like it has been doing so for years. You can test it with a command: %cd /usr/ports/www/apache13 %env APACHE_HARD_SERVER_LIMIT=1024 make Sure, the port build works: % cd /usr/ports/www/apache13 % env APACHE_HARD_SERVER_LIMIT=1024 make === src/os/unixcc -c -I../../os/unix -I../../include -I/usr/local/include -funsigned-char -O2 -fno-strict-aliasing -pipe -DDOCUMENT_LOCATION=\/usr/local/www/data\ -DDEFAULT_PATH=\/bin:/usr/bin:/usr/local/bin\ -DHARD_SERVER_LIMIT=1024 `../../apaci` os.c while portupgrade -M fails: % portinstall -M APACHE_HARD_SERVER_LIMIT=1024 www/apache13 === src/os/unixcc -c -I../../os/unix -I../../include -I/usr/local/include -funsigned-char -O2 -fno-strict-aliasing -pipe -DDOCUMENT_LOCATION=\/usr/local/www/data\ -DDEFAULT_PATH=\/bin:/usr/bin:/usr/local/bin\ -DHARD_SERVER_LIMIT=512 `../../apaci` os.c against of % make APACHE_HARD_SERVER_LIMIT=1024 This doesn't make much sense to me. Perhaps you meant: % make -DAPACHE_HARD_SERVER_LIMIT=1024 But I wouldn't expect this to succeed either: === src/os/unixcc -c -I../../os/unix -I../../include -I/usr/local/include -funsigned-char -O2 -fno-strict-aliasing -pipe -DDOCUMENT_LOCATION=\/usr/local/www/data\ -DDEFAULT_PATH=\/bin:/usr/bin:/usr/local/bin\ -DHARD_SERVER_LIMIT=512 `../../apaci` os.c Your suggestion however (to replace -M with -m) surprisingly worked: cc -c -I../../os/unix -I../../include -I/usr/local/include -funsigned-char -O2 -fno-strict-aliasing -pipe -DDOCUMENT_LOCATION=\/usr/local/www/data\ -DDEFAULT_PATH=\/bin:/usr/bin:/usr/local/bin\ -DHARD_SERVER_LIMIT=1024 `../../apaci` os.c And this is what's confusing. I think you confuse the two variables types. No, I think I know what these are for. But trying your suggestion to replace -M with -m and finding it to work (for some ports?), just threw some more fog into the case. Let's say it clear again - I have found that all recent versions of portupgrade (2.1.0+) fail to obey the -M switch and ignore any optional port parameters (i.e. arguments to prepend to each make command line) supplied there. Please, don't get me wrong. I'm not asking for help or for a workaround. I'm actually trying to help identifying a problem or regression. If this is not a bug, but a feature change, please have it documented. What the portupgrade(1) man page says about the -M switch is incorrect, as it no longer prepends any arguments specified to each make(1) command line. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
portupgrade bug: -M no longer works after v2.1.0
Recent portupgrade versions no longer obey the -M command line switch, i.e. any optional arguments to be prepended to each make command. How to reproduce: # portinstall -M APACHE_HARD_SERVER_LIMIT=1024 www/apache13 ... === src/ap cc -c -I../os/unix -I../include -I/usr/local/include -funsigned-char -O2 -fno-strict-aliasing -pipe -DDOCUMENT_LOCATION=\/usr/local/www/data\ -DDEFAULT_PATH=\/bin:/usr/bin:/usr/local/bin\ -DHARD_SERVER_LIMIT=512 `../apaci` ap_cpystrn.c ... Note the -DHARD_SERVER_LIMIT=512 above. The stock version shipped with 6.1-RELEASE (2.0.1_1) seems to work as expected. I tried all CVS versions after that (with portdowngrade) and found that the breakage has happened somewhere between 2.1.0 and 2.1.1 (2006/06/02 - 2006/06/04). Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: portupgrade bug: -M no longer works after v2.1.0
Matthias Andree said the following on 7/11/06 1:48 PM: Atanas [EMAIL PROTECTED] writes: Recent portupgrade versions no longer obey the -M command line switch, i.e. any optional arguments to be prepended to each make command. How to reproduce: # portinstall -M APACHE_HARD_SERVER_LIMIT=1024 www/apache13 ... === src/ap cc -c -I../os/unix -I../include -I/usr/local/include -funsigned-char -O2 -fno-strict-aliasing -pipe -DDOCUMENT_LOCATION=\/usr/local/www/data\ -DDEFAULT_PATH=\/bin:/usr/bin:/usr/local/bin\ -DHARD_SERVER_LIMIT=512 `../apaci` ap_cpystrn.c ... Note the -DHARD_SERVER_LIMIT=512 above. Does it work if you type (you can omit the env in /bin/sh, bash, (pd)ksh and other Bourne-like shells): env APACHE_HARD_SERVER_LIMIT=1024 portinstall www/apache13 Of course it would, but this just bypasses the problem. There are other ways to work this around as well - like not using portupgrade at all and building everything with make. The problem is that there's a bug introduced by some of the recent portupgrade versions that changes its documented behavior. The '-M' switch in partucular no longer works, thus causing any existing port/package installation scripts depending on that switch to build packages with incorrect optional parameters. It's not a problem with a particular port. The www/apache13 port was given just as example how to reproduce the bug. This affects _all_ ports when installed/upgraded/built via portupgrade and when the '-M' switch is used. (Isn't it time to migrate to a newer Apache version anyways? 8-) ) (This is a long subject and kind of off-topic here. My short answer is no, or not yet. In some environments there are still legitimate reasons to use 1.3) Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em device hangs on ifconfig alias ...
Pyun YongHyeon said the following on 7/7/06 8:32 PM: On Fri, Jul 07, 2006 at 10:38:01PM +0100, Robert Watson wrote: Yes -- basically, there are two problems: (1) A little problem, in which an arp announcement is sent before the link has settled after reset. (2) A big problem, in which the interface is gratuitously recent requiring long settling times. I'd really like to see a fix to the second of these problems (not resetting when an IP is added or removed, resulting in link renegotiation); the first one I'm less concerned about, although it would make some amount of sense to do an arp announcement when the link goes up. Ah, I see. Thanks for the insight. How about the attached patch? This patch seems to fix both of the issues, or at least this is what I see now: - the card no longer gets reset when adding an alias; - the arp packet gets delivered; - adding 250 aliases takes less than a second; I haven't fully tested whether all 250 IP aliases were accessible (I used non-routable IP addresses), but I suppose so. Also I couldn't stress the patched driver enough to see whether it performs as expected. But in overall it looks good. I guess some more testing might be needed in order to merge the patch into the source tree. Regards, Atanas Index: if_em.c === RCS file: /pool/ncvs/src/sys/dev/em/if_em.c,v retrieving revision 1.116 diff -u -r1.116 if_em.c --- if_em.c 6 Jun 2006 08:03:49 - 1.116 +++ if_em.c 8 Jul 2006 03:30:36 - @@ -67,6 +67,7 @@ #include netinet/in_systm.h #include netinet/in.h +#include netinet/if_ether.h #include netinet/ip.h #include netinet/tcp.h #include netinet/udp.h @@ -692,6 +693,9 @@ EM_LOCK_ASSERT(sc); + if ((ifp-if_drv_flags (IFF_DRV_RUNNING|IFF_DRV_OACTIVE)) != + IFF_DRV_RUNNING) + return; if (!sc-link_active) return; @@ -745,6 +749,7 @@ { struct em_softc *sc = ifp-if_softc; struct ifreq *ifr = (struct ifreq *)data; + struct ifaddr *ifa = (struct ifaddr *)data; int error = 0; if (sc-in_detach) @@ -752,9 +757,22 @@ switch (command) { case SIOCSIFADDR: - case SIOCGIFADDR: - IOCTL_DEBUGOUT(ioctl rcv'd: SIOCxIFADDR (Get/Set Interface Addr)); - ether_ioctl(ifp, command, data); + if (ifa-ifa_addr-sa_family == AF_INET) { + /* +* XXX +* Since resetting hardware takes a very long time +* we only initialize the hardware only when it is +* absolutely required. +*/ + ifp-if_flags |= IFF_UP; + if (!(ifp-if_drv_flags IFF_DRV_RUNNING)) { + EM_LOCK(sc); + em_init_locked(sc); + EM_UNLOCK(sc); + } + arp_ifinit(ifp, ifa); + } else + error = ether_ioctl(ifp, command, data); break; case SIOCSIFMTU: { @@ -802,17 +820,19 @@ IOCTL_DEBUGOUT(ioctl rcv'd: SIOCSIFFLAGS (Set Interface Flags)); EM_LOCK(sc); if (ifp-if_flags IFF_UP) { - if (!(ifp-if_drv_flags IFF_DRV_RUNNING)) { + if ((ifp-if_drv_flags IFF_DRV_RUNNING)) { + if ((ifp-if_flags ^ sc-if_flags) + IFF_PROMISC) { + em_disable_promisc(sc); + em_set_promisc(sc); + } + } else em_init_locked(sc); - } - - em_disable_promisc(sc); - em_set_promisc(sc); } else { - if (ifp-if_drv_flags IFF_DRV_RUNNING) { + if (ifp-if_drv_flags IFF_DRV_RUNNING) em_stop(sc); - } } + sc-if_flags = ifp-if_flags; EM_UNLOCK(sc); break; case SIOCADDMULTI: @@ -878,8 +898,8 @@ break; } default: - IOCTL_DEBUGOUT1(ioctl received: UNKNOWN (0x%x), (int)command); - error = EINVAL; + error = ether_ioctl(ifp, command, data); + break; } return (error); Index: if_em.h === RCS file: /pool/ncvs/src/sys/dev/em/if_em.h,v retrieving revision 1.44 diff -u -r1.44 if_em.h --- if_em.h 15 Feb 2006 08:39:50
Re: em device hangs on ifconfig alias ...
Robert Watson said the following on 7/7/06 7:17 AM: I just left a tcpdump -n arp host 10.10.64.40 on a third machine sniffing around and tested all em module versions I had (the stock 6.1, 6-STABLE and 6-STABLE with your patch), but got silence on all three: That's odd. I've tested it on CURRENT and I could see the ARP packet. Are you sure you patched correctly? If so I have to build a RELENG_6 machine and give it try. Is it possible you're seeing an interaction between the reset generated as part of IP address changing, and the time it takes to negotiate link? It's possible that the arp packets are being eaten during the link negotiation, so for systems negotiating quickly (or not at all) then the arp packet is seen on other hosts, and otherwise not... Looks like this is exactly what happens. I was able to see it by running two tcpdump instances - one on the EM machine running in background and another running elsewhere on the same subnet. So on the EM machine the arp packet actually gets generated by em(4) and caught by the tcpdump running there: EM# tcpdump -n arp and ether src 00:04:23:b5:1b:ff EM# EM# ifconfig em1 inet alias 10.10.64.40 EM# 11:28:37.178946 arp who-has 10.10.64.40 tell 10.10.64.40 EM# But it doesn't reach the other tcpdump instance running on another host. It seems that the arp packet gets killed before leaving the EM machine, due to the card initialization or something else. I tried sending it manually with arping, just to make sure both tcpdumps operate properly and yes, the packet got delivered to both. I think that I have patched, built and loaded the em(4) kernel module correctly. After applying the patch there were no rejects, before building the module I intentionally appended (patched) to its version string in if_em.c, and could see that in dmesg every time I loaded the module: em1: Intel(R) PRO/1000 Network Connection Version - 3.2.18 (patched) Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em device hangs on ifconfig alias ...
Pyun YongHyeon said the following on 7/5/06 7:14 PM: Here is patch generated against RELENG_6. OK, I just tested that, but it doesn't seem to make any difference. Here's what I did: I commented out the em device from my kernel (a 6-STABLE one from yesterday) and compiled three if_em kernel modules: - one taken from 6.1 release - the unpatched 6-STABLE one - the latter with the above patch applied So I was able to load and test each of these modules independently and without actually restarting the machine. I changed also the driver version string in if_em.c, just to ensure that I'm really loading the right em module by checking dmesg: em1: Intel(R) PRO/1000 Network Connection Version - 3.2.18 (patched) port 0xdc80-0xdcbf mem 0xfcfe-0xfcff irq 55 at device 4.1 on pci3 em1: Ethernet address: 00:04:23:b5:1b:ff em1: link state changed to UP I used 2 machines - one running 6.1-RELEASE and using fxp (I'll call it FXP), and the test one running 6-STABLE with em (I'll call it EM), and tried exchanging/moving an IP alias between them. FXP# ifconfig fxp0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST mtu 1500 options=bRXCSUM,TXCSUM,VLAN_MTU inet 10.10.64.30 netmask 0xff00 broadcast 10.10.64.255 ether 00:e0:81:31:f4:1e media: Ethernet autoselect (100baseTX full-duplex) status: active EM# ifconfig em1: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST mtu 1500 options=bRXCSUM,TXCSUM,VLAN_MTU inet 10.10.64.63 netmask 0xff00 broadcast 10.10.64.255 ether 00:04:23:b5:1b:ff media: Ethernet autoselect (100baseTX full-duplex) status: active First I brought up an IP alias on the FXP machine: FXP# ifconfig fxp0 inet alias 10.10.64.40 netmask 255.255.255.255 and checked whether it's accessible from anywhere - yes. Then I moved that to EM: FXP# ifconfig fxp0 inet -alias 10.10.64.40 EM# ifconfig em1 inet alias 10.10.64.40 netmask 255.255.255.255 and checked again - no. It was accessible only from its own subnet (10.10.64.x), but not from anywhere else. Moving that back to FXP works, but moving it back to EM doesn't. The only way I found to make it accessible was to arping something from the aliased IP address: EM# arping -S10.10.64.40 -c1 somehost So it seems that when an IP alias has been recently used on some other machine (on FXP in my case), the em driver is unable to initialize that IP alias properly. It might be that the fxp driver is not sending something when releasing an alias, who knows. But fact is that fxp always initializes its aliases properly - I use it extensively and it always worked. I tried setting another IP alias that never has been used on these machines. I brought that up first on EM and it worked. The moved it to FXP and it also worked! But moving it back to EM made it inaccessible. It looks like there's something fishy with the alias initialization. Another related problem is that the card gets re-initialized (reset?) on each alias you add (takes between 0.3 and 1 seconds, depending how fast the hardware is), which for mass aliased systems could be a serious hurdle after a crash or reboot. Regards, Atanas --- if_em.c.origFri May 19 09:19:57 2006 +++ if_em.c Thu Jul 6 11:10:56 2006 @@ -657,8 +657,9 @@ mtx_assert(adapter-mtx, MA_OWNED); -if (!adapter-link_active) -return; + if ((ifp-if_drv_flags (IFF_DRV_RUNNING|IFF_DRV_OACTIVE)) != + IFF_DRV_RUNNING) + return; while (!IFQ_DRV_IS_EMPTY(ifp-if_snd)) { @@ -719,11 +720,6 @@ if (adapter-in_detach) return(error); switch (command) { - case SIOCSIFADDR: - case SIOCGIFADDR: - IOCTL_DEBUGOUT(ioctl rcv'd: SIOCxIFADDR (Get/Set Interface Addr)); - ether_ioctl(ifp, command, data); - break; case SIOCSIFMTU: { int max_frame_size; @@ -760,16 +756,17 @@ IOCTL_DEBUGOUT(ioctl rcv'd: SIOCSIFFLAGS (Set Interface Flags)); EM_LOCK(adapter); if (ifp-if_flags IFF_UP) { - if (!(ifp-if_drv_flags IFF_DRV_RUNNING)) { + if ((ifp-if_drv_flags IFF_DRV_RUNNING)) { + if ((ifp-if_flags ^ adapter-if_flags) + IFF_PROMISC) { + em_disable_promisc(adapter); + em_set_promisc(adapter); + } + } else em_init_locked(adapter); - } - - em_disable_promisc(adapter); - em_set_promisc(adapter); } else { - if (ifp-if_drv_flags IFF_DRV_RUNNING
Re: em device hangs on ifconfig alias ...
Pyun YongHyeon said the following on 7/6/06 6:03 PM: Hmm, that's strange. I've double checked that stock em(4) didn't generate ARP packets when its addresses were changed. So I made em(4) generate ARP. Could you see a gratuitous ARP with tcpdump when you change its address? I just left a tcpdump -n arp host 10.10.64.40 on a third machine sniffing around and tested all em module versions I had (the stock 6.1, 6-STABLE and 6-STABLE with your patch), but got silence on all three: EM# ifconfig em1 inet alias 10.10.64.40 nothing EM# ifconfig em1 inet -alias 10.10.64.40 nothing The fxp driver appears to send something on startup and nothing on shutdown: FXP# ifconfig fxp0 inet alias 10.10.64.40 18:41:54.584059 arp who-has 10.10.64.40 tell 10.10.64.40 FXP# ifconfig fxp0 inet -alias 10.10.64.40 nothing When I manually arping the em alias after startup (i.e. simulate what fxp does), everything works as expected: EM# ifconfig em1 inet alias 10.10.64.40 nothing EM# arping -c1 -S10.10.64.40 10.10.64.40 18:46:07.808701 arp who-has 10.10.64.40 tell 10.10.64.40 EM# ifconfig em1 inet -alias 10.10.64.40 nothing It appears that this is what the em driver is supposed to do, or at least fxp does it in this way. This is other issue. em(4) performs two time-consuming operations in its initialization routine. One is DMA tag/map creation and the other is checksumming EEPROM contents in init routine. I have an experimental patch for it but let's fix one at a time. OK, let's put that aside for now. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em device hangs on ifconfig alias ...
Pyun YongHyeon said the following on 6/30/06 8:54 PM: On Fri, Jun 30, 2006 at 12:28:49PM -0700, Atanas wrote: User Freebsd said the following on 6/29/06 9:29 PM: The other funny thing about the current em driver is that if you move an IP to it from a different server, the appropriate ARP packets aren't sent out to redirect the IP traffic .. recently, someone pointed me to arping, which has solved my problem *external* to the driver ... That's the second reason why I (still) avoid em in mass-aliased systems. I have a single pool of IP addresses shared by many servers with multiple aliases each. When someone leaves and frees an IP, it gets reused and brought up on a different server. In case it was previously handled by em, the traffic doesn't get redirected to the new server. Similar thing happens even with machines with single static IPs. For instance when retiring an old production system, I usually request a new box to be brought up on a different IP, make a fresh install on everything and test, swap IP addresses and reboot. In case of em, after a soft reboot both systems are inaccessible. A workaround is to power both of the systems down and then power them up. This however cannot be done remotely and in case there were IP aliases, they still don't get any traffic. I haven't fully tested it but what about attached patch? It may fix your ARP issue. The patch also fixes other issues related with ioctls. Now em(4) will send a ARP packet when its IP address is changed even if there is no active link. Since em(4) is not mii-aware driver I can't sure this behaviour is correct. The patch is against if_em.c,v 1.116 2006/06/06, which is 7-CURRENT. I tried merging the relevant em driver files into a 6-STABLE installation by simply copying sys/dev/em/* and sys/modules/em/Makefile, but it seems that the new revision depends on other -CURRENT things and the module build fails: # pwd /usr/src/sys/modules/em # make clean; make ... /usr/src/sys/modules/em/../../dev/em/if_em.c: In function `em_setup_interface': /usr/src/sys/modules/em/../../dev/em/if_em.c:2143: error: `IFCAP_VLAN_HWCSUM' undeclared (first use in this function) ... I don't have a 7-CURRENT based box around. It seems too bleeding edge for me anyway. I was hoping to play with different if_em kernel modules on a semi-production (spare) box and eventually test the proposed em patch, but apparently it's not so easy. Please let me know if I'm missing something obvious. Thanks, Atanas Index: if_em.c === RCS file: /pool/ncvs/src/sys/dev/em/if_em.c,v retrieving revision 1.116 diff -u -r1.116 if_em.c --- if_em.c 6 Jun 2006 08:03:49 - 1.116 +++ if_em.c 1 Jul 2006 03:51:41 - @@ -692,7 +692,8 @@ EM_LOCK_ASSERT(sc); - if (!sc-link_active) + if ((ifp-if_drv_flags (IFF_DRV_RUNNING|IFF_DRV_OACTIVE)) != + IFF_DRV_RUNNING) return; while (!IFQ_DRV_IS_EMPTY(ifp-if_snd)) { @@ -751,11 +752,6 @@ return (error); switch (command) { - case SIOCSIFADDR: - case SIOCGIFADDR: - IOCTL_DEBUGOUT(ioctl rcv'd: SIOCxIFADDR (Get/Set Interface Addr)); - ether_ioctl(ifp, command, data); - break; case SIOCSIFMTU: { int max_frame_size; @@ -802,17 +798,19 @@ IOCTL_DEBUGOUT(ioctl rcv'd: SIOCSIFFLAGS (Set Interface Flags)); EM_LOCK(sc); if (ifp-if_flags IFF_UP) { - if (!(ifp-if_drv_flags IFF_DRV_RUNNING)) { + if ((ifp-if_drv_flags IFF_DRV_RUNNING)) { + if ((ifp-if_flags ^ sc-if_flags) + IFF_PROMISC) { + em_disable_promisc(sc); + em_set_promisc(sc); + } + } else em_init_locked(sc); - } - - em_disable_promisc(sc); - em_set_promisc(sc); } else { - if (ifp-if_drv_flags IFF_DRV_RUNNING) { + if (ifp-if_drv_flags IFF_DRV_RUNNING) em_stop(sc); - } } + sc-if_flags = ifp-if_flags; EM_UNLOCK(sc); break; case SIOCADDMULTI: @@ -878,8 +876,8 @@ break; } default: - IOCTL_DEBUGOUT1(ioctl received: UNKNOWN (0x%x), (int)command); - error = EINVAL; + error = ether_ioctl(ifp, command, data); + break; } return (error); Index: if_em.h
Re: em device hangs on ifconfig alias ...
Michael Vince said the following on 6/29/06 8:53 PM: The thing that have to ask is if Atanas has 100's why can't he just boot Freebsd have have them all prebound to the interface at startup, why would you need to add and remove them constantly by the hundreds during normal server uptime? I wasn't talking about the normal server uptime. Sooner or later, regardless of how perfect the hardware is and how great the OS performs, you will have to reboot. At least once or twice a year to update the kernel and/or world. Even in such rare occasions several minutes of additional downtime per reboot (in my case) are not justifiable. I know that in a perfect world this downtime could be scheduled. But I prefer to keep the option to quickly reboot my systems when necessary. 2 vs 10-15 minutes downtime per reboot really makes a difference. How you bind the aliases doesn't really matter - you always end up waiting the em driver to reset the card on each alias. And don't get me wrong, I do use em for years on many machines having one static IP or a few additional static aliases and it works great. It just doesn't fit well in mass-alias configurations. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em device hangs on ifconfig alias ...
User Freebsd said the following on 6/29/06 9:29 PM: The other funny thing about the current em driver is that if you move an IP to it from a different server, the appropriate ARP packets aren't sent out to redirect the IP traffic .. recently, someone pointed me to arping, which has solved my problem *external* to the driver ... That's the second reason why I (still) avoid em in mass-aliased systems. I have a single pool of IP addresses shared by many servers with multiple aliases each. When someone leaves and frees an IP, it gets reused and brought up on a different server. In case it was previously handled by em, the traffic doesn't get redirected to the new server. Similar thing happens even with machines with single static IPs. For instance when retiring an old production system, I usually request a new box to be brought up on a different IP, make a fresh install on everything and test, swap IP addresses and reboot. In case of em, after a soft reboot both systems are inaccessible. A workaround is to power both of the systems down and then power them up. This however cannot be done remotely and in case there were IP aliases, they still don't get any traffic. I have a third machine that uses an em driver, but its an older 4.x kernel, and it operates perfectly ... no timeouts/hangs and sends out the appropriate ARP packet ... all three servers are connected to the same Cisco switch, with all ports configured identically, so it isn't a switch issue, as someone else intimated ... This seems strange, could depend on the chip version, who knows. I still have many 4.x based machines, and both em issues (the card reset on each alias and the arp packets not been sent when going down) were present when I was doing my tests. I check for these once in a while (a year or so), usually with the latest major release branch. We had a compatibility issue about a year ago with a (rather exotic?) fiber NIC - 82545GM, where FreeBSD-4.x did better. The em driver coming with 5.x didn't support that (or wasn't working as expected, I don't remember the specifics), while the one coming with 4.x did, so we ended up installing 4.11 then. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em device hangs on ifconfig alias ...
User Freebsd said the following on 6/30/06 1:48 PM: see 'arping' ... great little tool, solved all my problems as far as moving around IPs ... Thanks for the tip, I will try it next time. I still have many 4.x based machines, and both em issues (the card reset on each alias and the arp packets not been sent when going down) were present when I was doing my tests. Right, what version of 4.x? The one that I have working is from ~Feb 2005 .. if I were to upgrade that to the latest 4-STABLE, it would break like the rest ... the older 4.x had a different em driver in the kernel then the newer one ... The problem was initially discovered back in 2003 (must have been with 4.8 or 4.9) and after switching back to fxp I haven't tested the 4.x branch any more. I remember testing 5.x around the 5.3 release (2004) and 6.x shortly after 6.0 (2005), and both em driver versions shipped with these releases were having the same issues. I haven't tested it this year yet, but even in case it's fixed, it's not likely that all improvements will get back-ported to the older branches (over 90% of my servers run something older than 6.x). But as long as fxp works, this is a non-issue for me. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Parallel fsck in non-preen/full mode?
Is there some easy way to force a full (non-preen) and at the same time parallel (i.e. one process per disk) fsck? It could be a real down time saver in crash recovery situations. Imagine the following (fairly typical in my case) scenario: You have many machines with some bunch of drives each and many files on each drive. After a crash (due to a hardware failure or else), the initial preen (fsck -p) fails. You have the following options: a) rely on the background fsck available for 5.x and up; b) set fsck_y_enable to YES to do fsck -y if the initial preen fails. c) fsck it manually via local or serial console; Background fsck relies on snapshots, which don't cope well with user quotas and often deadlocks and causes more crashes. Actually the QUOTA + snapshots combination worked somewhat better in 5.x than in 6.x now. For 6.1 it's no longer an option for me. An fsck -y is slow as hell as it doesn't run in parallel. For instance 6 72GB drives (each about 75% full with a million of files) could take good 2 hours, primarily because fsck assumes that interaction is required and runs the checks one at a time. Manual fsck needs attention (additional down time), and the fastest way to bring the machine back up is to do exactly the same what a fsck -p would to, but in _full_ mode, i.e.: # fsck -y da0s1a # fsck -y da0s1d # fsck -y da1s1d ... # fsck -y da7s1d # ps ax |grep fsck # ... # exit The above takes just 15 minutes or so, plus the time between the moment when the crash actually happens and the moment you start typing on the console (which sometimes could be much more than 15 minutes). This could be automated by putting something similar (plus perhaps some shell code taking device entries from /etc/fstab and a cycle waiting for the fsck processes to finish) in /etc/rc.early or a separate rc.d/ style script. But such a hack I think would look somewhat ugly in shell and would just mimic what fsck already does in order to check multiple drives when running in preen mode. It seems that it would be really helpful (and possibly harmless) if fsck could be forced to do checks in parallel when running with '-y' when console interaction is not needed anyway, or perhaps through a new switch (-Y?). I could try to eventually modify the fsck source and somehow change the default '-y' behavior. But I wouldn't like to carry such additional luggage of custom patches on all servers and also I don't think that I am the most qualified person to do so. So in case someone still reads this, please advice. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em device hangs on ifconfig alias ...
Dan Nelson said the following on 6/28/06 3:52 PM: In the last episode (Jun 28), User Freebsd said: has anyone figured out why the em device 'hangs' for about 30-45 seconds whenever you ifconfig alias a new IP on to the device? The em driver resets the card when you add an IP to it, and unless you've configured your switch not to autodetect fancy features on that port, it may very well take 45 seconds for it to come up. For me the em reset actually takes about a second or so per single IP alias. But more aliases you got, longer the timeout becomes. In case you have hundreds (like I do), a single reboot might cost you something like 10-15 minutes of downtime, just for the aliases to come up. That's the primary reason I stay away from the on-board 1Gbps em NICs that almost every Intel server board nowadays comes with. I simply disable them and use a good old (and cheap) Intel PRO/100 fxp compatible PCI NIC instead. It's fast enough and doesn't reset the card when you add an alias. The only downside is that it gives you 100Mbps at most. Does anybody know a better NIC driver alternative when dealing with lots of IP aliases? I have some newer machines with 2 Broadcom chips on-board. I plan to give them a try at some point in the future, but I'm not sure how stable the bge driver is when compared to fxp and em. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: fsck_ufs locked in snaplk
Kris Kennaway said the following on 4/25/06 9:22 AM: On Tue, Apr 25, 2006 at 06:39:09PM +0300, Kostik Belousov wrote: Obviously, revisions 1.78, 1.79 of the sys/ufs/ufs/ufs_quota.c shall be MFCed. Try this patch (note, I does not tested it): WTF, I could have sworn I merged that! Yes, this patch is needed. However, I don't think it's the cause of runtime deadlocks. Thanks for the heads up! I was just about to release the next production box without checking that and assuming the QUOTA fix was already in place. I would like to confirm that I have another fully loaded server running 6.1-PRERELEASE (BETA2 based on 6-STABLE) from Mar 1 with manually patched sys/ufs/ufs/ufs_quota.c (1.74.2.1 2006/01/14) with a similar diff generated from CURRENT between 1.77 and 1.80. 55 days uptime and no problems so far. Regards, Atanas P.S. Forgot to CC the list, sorry for the double post. Index: sys/ufs/ufs/ufs_quota.c === RCS file: /usr/local/arch/ncvs/src/sys/ufs/ufs/ufs_quota.c,v retrieving revision 1.77 retrieving revision 1.79 diff -u -r1.77 -r1.79 --- sys/ufs/ufs/ufs_quota.c 9 Jan 2006 20:42:19 - 1.77 +++ sys/ufs/ufs/ufs_quota.c 12 Feb 2006 13:20:06 - 1.79 @@ -429,8 +429,9 @@ quotaoff(td, mp, type); ump-um_qflags[type] |= QTF_OPENING; mp-mnt_flag |= MNT_QUOTA; - ASSERT_VOP_LOCKED(vp, quotaon); + vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, td); vp-v_vflag |= VV_SYSTEM; + VOP_UNLOCK(vp, 0, td); *vpp = vp; /* * Save the credential of the process that turned on quotas. @@ -535,8 +536,9 @@ } MNT_IUNLOCK(mp); dqflush(qvp); - ASSERT_VOP_LOCKED(qvp, quotaoff); + vn_lock(qvp, LK_EXCLUSIVE | LK_RETRY, td); qvp-v_vflag = ~VV_SYSTEM; + VOP_UNLOCK(qvp, 0, td); error = vn_close(qvp, FREAD|FWRITE, td-td_ucred, td); ump-um_quotas[type] = NULLVP; crfree(ump-um_cred[type]); ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: FreeBSD/i386 6-stable + 4 GB RAM
Oliver Fromme said the following on 03/14/06 08:30: Will FreeBSD/i386 6-stable run on a 4 GB machine out of the box? Do I have to apply special tuning (kernel config or sysctl or whatever)? Using PAE shouldn't be necessary, I assume. All it depends is what size of memory address space the motherboard manufacturer decided to reserve for PCI devices. I've seen boards with PCI window size ranging from 256 to 1024MB. In order to utilize the full amount of RAM you would need PAE. It would also be interesting to know if 4-stable (which is currently running on the predecessor machines) would run without problems on those new 4 GB ones, too. For 4-STABLE you would need to adjust KVA_PAGES. Otherwise, depending on the load and the memory usage, you might get random crashes. Or at least this is what I experienced when upgrading RAM on a bunch of 4.x based machines. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: well-supported SATA RAID card?
Brian Szymanski said the following on 03/10/06 01:01: Howdy... After not having much success with the hptmv driver for highpoint's rocketraid 1820A, I'm wondering if other folks have had good luck with any SATA RAID cards with at least 6 ports... Is there a SATA RAID card with utilities that let you manage while the OS is running that folks have had good luck with? I've been happy with the megaraid series on linux at my job, but I'm wondering if the management utilities are there on freebsd, etc. 3ware. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
hw.realmem on i386
I'm setting 6.1-BETA2/i386 on a AMD-based (dual Opteron 270) Tyan K8SE S2892 motherboard with 4GB RAM. The PCI memory address range on this board takes entire gigabyte, leaving only 3GB of usable memory in i386 mode. The remaining part gets remapped (by the BIOS) above the 4GB limit. On the Intel server boards the PCI range used to take only 256MB or 512MB, so I could afford ignoring that. But 1GB now seems too much and I decided to compile a PAE enabled kernel and see what happens. The PAE enabled kernel detects the full amount of RAM, boots normally and seems all right so far (not in production yet): # dmesg |grep memory real memory = 5368709120 (5120 MB) avail memory = 4182597632 (3988 MB) # memcontrol list ... 0x0/0x8000 BIOS write-back set-by-firmware active 0x8000/0x4000 BIOS write-back set-by-firmware active 0x1/0x4000 BIOS write-back set-by-firmware active The thing that puzzles me is the sysctl hw.realmem value: # sysctl -a |grep hw.*mem: hw.physmem: 4286291968 hw.usermem: 4106076160 hw.realmem: 1073741824 Wasn't this supposed to be greater that both hw.physmem and hw.usermem? Or at least this is what I see on all other (non-PAE) boxes I have: realmem physmem usermem Here are a few examples: Intel SE7501WV2, 4GB (-256MB), non-PAE i386: hw.physmem: 4017508352 hw.usermem: 3792785408 hw.realmem: 4026466304 Intel SE7520JR2, 4GB (-512MB), non-PAE i386: hw.physmem: 3749007360 hw.usermem: 3285360640 hw.realmem: 3757965312 AMD-based, 4GB, amd64: hw.physmem: 4218327040 hw.usermem: 399648 hw.realmem: 4227792896 I'm wondering what impact such a supposedly incorrect hw.realmem hw.physmem value could have, and whether the kernel options would need to be tweaked manually in order to fix that. I remember a case when I had to upgrade a 4.x based box from 2GB to 4GB, so vm.kvm_free became larger than vm.kvm_size resulting in random crashes (until I realized that I had to manually adjusting the KVA_PAGES kernel option, but it does not seems to be much relevant here). I'm running 6.1-PRERELEASE (6-STABLE) from Feb 22 2005 with the following mods to the kernel configuration files: GENERIC: cpu I486_CPU options INET6 # IPv6 communications protocols options SCSI_DELAY=5000 # Delay (in ms) before probing SCSI options QUOTA options SMP # Symmetric MultiProcessor Kernel PAE: options IPFIREWALL options DUMMYNET Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: AMD64 or I386
Albert Shih said the following on 02/22/06 16:59: Hi all I've very strange problem with my new servers with AMD single core dual proc with 4 Go Ram. When I boot the i386 version of FreeBSD 6.0 he see 4 Go but tell me he can't not access to 4go but only 3 Go. I upgrade to FreeBSD 6-Stable and nothing change. When I boot the amd64 version of FreeBSD 6.0 he see 5 Go (!!) but tell me he can access only 4 go (well). When I upgrade to FreeBSD 6-Stable nothing change. My problem is I need i386 version (because I'need maxima who need sbcl...and sbcl don't run on amd64). If I tell the kernel I've 4 Go the system don't boot (kernel panic). What can I do. You might need to compile a PAE enabled kernel, see the pae(4) man page for more details. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: SSH login takes very long time...sometimes
Carl Makin said the following on 02/16/06 20:07: Atanas wrote: Does anybody know whether ipfw (or something else within FreeBSD-4) is capable of setting connection rate limits? I'm using SEC to monitor the auth.log file and block any IP addresses that fail a password 3 times within 60 seconds. I use the following sec.conf file; Yeah, it does pretty much the same thing I do with a simple script like: #!/usr/bin/perl use strict; my $MAX_TRIES = 5; my $RULE_BASE = 10100; my $RULES_MAX = 10; my $Rule = $RULE_BASE; my %Match; sub ip_block # ($ip, $port) { my ($ip, $port) = @_; `ipfw delete $Rule` if `ipfw list $Rule 2/dev/null`; `ipfw add $Rule deny tcp from $ip to any $port in setup`; $Rule = $RULE_BASE + (++$Rule - $RULE_BASE) % $RULES_MAX; } open LOG, tail -f /var/log/auth.log |; while (LOG) { if( /sshd\[\d+\]/ ) { if( /((Illegal user|Failed password for) \S+|Did not receive identification string) from (\d+\.\d+\.\d+\.\d+)/ ) { my $ip = $3; next if $Match{$ip}++ $MAX_TRIES; ip_block($ip,22); undef $Match{$ip}; } } } close F; And a cron job removes the blocks every hour: 7 * * * * /sbin/ipfw delete 10100 10101 10102 10103 10104 10105 10106 10107 10108 10109 It does the job, but it would be nice for sshd to have some rate-limit protection built-in. Otherwise, with the increasing number of attacks nowadays, many people would need similar protection. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: SSH login takes very long time...sometimes
Marian Hettwer said the following on 02/17/06 00:39: Atanas wrote: Last year I already had to decrease the LoginGraceTime from 120 to 30 seconds on my production boxes, but it didn't help much, so on top of that I got to implement (reinvent the wheel again) a script tailing the auth.log and firewalling bad gyus in order to secure sshd and let my legitimate users in. You could get rid of parsing auth.log and everything and just use pf(4) instead. Look at that: # sshspammer table table sshspammer persist block log quick from sshspammer # sshspammer # more than 6 ssh attempts in 15 seconds will be blocked ;) pass in quick on $ext_if proto tcp to ($ext_if) port ssh $tcp_flags (max-src-con n 10, max-src-conn-rate 6/15, overload sshspammer flush global) Thanks for the suggestion! The pf in 5.x/6.x base and especially its rate-limit capability seems to be a good reason to upgrade my existing 4.x based boxes before RELENG_4's EoL. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: SSH login takes very long time...sometimes
Mike Tancsa said the following on 02/17/06 11:50: At 09:17 PM 16/02/2006, Atanas wrote: Does anybody know whether ipfw (or something else within FreeBSD-4) is capable of setting connection rate limits? Why not just launch sshd out of inetd ? Primarily because of the big scare sign in the sshd man page: -i Specifies that sshd is being run from inetd(8). sshd is normally not run from inetd because it needs to generate the server key before it can respond to the client, and this may take tens of ^^^ seconds. Clients would have to wait too long if the key was ^^^ regenerated every time. However, with small key sizes (e.g., 512) using sshd from inetd may be feasible. It was my fault not verifying how much time it really takes. I just tested it on a couple of machines, and it seems to be way faster: # time ssh [EMAIL PROTECTED] real0m0.669s user0m0.012s sys 0m0.000s # time ssh [EMAIL PROTECTED] real0m0.374s user0m0.000s sys 0m0.008s # time ssh [EMAIL PROTECTED] real0m0.348s user0m0.000s sys 0m0.008s I ran this multiple times. The first one defaults to 2048-bit key (a 6-STABLE based box), the second one - to 1048 bit (5.4), the third one to a standalone ssh daemon. So what the man page says about the timings could have been true some 10 years ago, but not now. Start up inetd with -wWl -C 5 In inetd.conf ssh stream tcp nowait root /usr/sbin/sshd /usr/sbin/sshd -i This will allow 5 connections per min from a single IP. Yeah, I still use it to run (pro)ftpd, and never had problems with that. It's possible to specify also per entry limits, like: ftp stream tcp nowait/100/60/10 root /usr/libexec/ftpd ftpd -l ssh stream tcp nowait/50/10/5root /usr/sbin/sshd sshd -i 50/10/5 = max-children/max-conn-per-ip-per-minute/max-child-per-ip Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: SSH login takes very long time...sometimes
Dag-Erling Smørgrav said the following on 02/15/06 23:35: David Malone [EMAIL PROTECTED] writes: I did once mail des@ to ask him if he'd mind me changing the default login timeout for sshd to be (say) 5 minutes rather than 1 minute, but I think he was busy at the time. Judging by the PR mentioned above it should be at least 2m30s by default. Des, would you mind this change being made? No objection, just let me see the patch first. DES Just a thought, wouldn't this open a new possibility for denial of service attacks? Last year I already had to decrease the LoginGraceTime from 120 to 30 seconds on my production boxes, but it didn't help much, so on top of that I got to implement (reinvent the wheel again) a script tailing the auth.log and firewalling bad gyus in order to secure sshd and let my legitimate users in. I really miss the inetd features. A setting like nowait/100/20/5 (/max-child[/max-connections-per-ip-per-minute[/max-child-per-ip]]) would effectively bounce the bad guys, but AFAIK (correct me if I'm wrong), ssh is no longer supposed to work via inetd and still has no such capabilities. I'd be nice to have something like for instance the sendmail's client and rate connection limits, but I guess this is not the right place to ask. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: SSH login takes very long time...sometimes
David Malone said the following on 02/16/06 13:24: Just a thought, wouldn't this open a new possibility for denial of service attacks? I doubt it. I'm guessing you're thinking of an attack where someone makes many connections to sshd in a short time and runs you out of processes? I think you can protect against this with the MaxStartups directive in sshd_config. The amount of time that an attacker has to open many connections is probably not that important, as you can open a lot of TCP connections in 1 second even with a small link. These were different types of attacks, primarily originating from single IP addresses: 1. Dictionary attacks taking as much concurrent unauthenticated connections as possible and with speeds as fast as the server can respond. These were happening like a few up to several times a day, sometimes lasting hours. 2. Time based attacks taking again all of the available MaxStartups, but then doing nothing until the LoginGraceTime expires, then again, etc. These were not so frequent, but had the worst impact on the ssh availability. 3. Network scans on machines hosting some hundreds (or in some cases thousands) of IP addresses, causing outages lasting just a few minutes or so. Last year I already had to decrease the LoginGraceTime from 120 to 30 seconds on my production boxes, but it didn't help much, so on top of that I got to implement (reinvent the wheel again) a script tailing the auth.log and firewalling bad gyus in order to secure sshd and let my legitimate users in. Are you trying to prevent the ssh scanners that just try well-known combinations of usernames and passwords? It is not clear that you gain much by firewalling these off, other than having fewer log messages. All of the above three. It wasn't just a matter of too much log messages. The type 1. for instance, besides the ssh unavailability, and depending on the MaxStartups setting, can bring a server to its knees by dedicating all of its available resources for bouncing unauthenticated ssh requests. I tried setting a 'limit' ipfw firewall rule, something like: ipfw add allow tcp from any to any 22 in setup limit src-addr 5 I already had success with such a rule for a first level SMTP DoS protection before sendmail got its per-client and rate connection limits built in (since 8.13.0), and still keep that on, just in case. But unlike sendmail, ssh instances when hammered with bogus requests are way more CPU intensive. I couldn't afford limiting the ssh connectivity to just one single session per client IP (someone might need multiple ssh sessions while working on something, right?), and in case of multiple sessions enabled, machines would be still vulnerable to CPU overload (i.e. bouncing tons of useless ssh authentication attempts). So the best option for me was to implement a log analyzer script placing temporary blocks on the firewall when necessary. Like after 5 Illegal user or Failed password for or Did not receive identification string events, the script simply denies that IP right away on the firewall for one hour. So far this works well (for about 6 months already) and I no longer see unusual load spikes or ssh connectivity outages like before. I really miss the inetd features. A setting like nowait/100/20/5 (/max-child[/max-connections-per-ip-per-minute[/max-child-per-ip]]) would effectively bounce the bad guys, but AFAIK (correct me if I'm wrong), ssh is no longer supposed to work via inetd and still has no such capabilities. You can still run sshd through inetd (or, at least, the -i option is still documented in the sshd man page). If does suggest that you may need to reduce the key size to make this practical (increasing LoginGraceTime here may help too ;-) I knew that, but actually never tried it thinking it would be too slow. Now I just ran a ssh-keygen and found that it takes only a few seconds for a 1024-bit and several for 2048-bit key. So it's not that much bad and running it with 512 or 1024-bit key through inetd seems feasible enough. The default ssh key length in FreeBSD-6 however just got doubled from 1024 to 2048-bit. I believe there's a reason for that and don't like the idea of going down. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: SSH login takes very long time...sometimes
Niki Denev said the following on 02/16/06 16:11: I solved this for me with the following pf(4) rule : pass in quick on $ext inet proto tcp from any to any port ssh flags S/SA \ keep state (source-track rule, max-src-conn $max_conn_per_ip, max-src-conn-rate $max_conn_rate, \ overload tempban-ssh flush global) with appropriate $max_conn_per_ip and $max_conn_rate limits, and expiretable in a cronjob to flush all entries in the tempban-ssh table which are older than predefined period. I hope this helps. Thanks for the tip! I knew that at some point I will have to switch to pf, but unfortunately it wasn't available in FreeBSD-4.x, and I still have plenty of such boxes. Does anybody know whether ipfw (or something else within FreeBSD-4) is capable of setting connection rate limits? Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: SSH login takes very long time...sometimes
[EMAIL PROTECTED] said the following on 02/16/06 14:49: Hello, You should try Xinetd as it has more options to help with this. I beleive you SSH problem is due to a DNS/RDNS problem. No, it wasn't a DNS issue. (x)inetd would help, but in such a case sshd would need to generate a server key (takes seconds and CPU) on every incoming ssh connection, which would be kind of slow and wasteful. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: 6.0-RELEASE freeze repetitively
Henri Hennebert said the following on 01/07/06 03:52: Hello, I have a production server which freeze from time to time (from 5 hours to 7 days between freeze). I try (in /boot/loader.conf.local): - hint.acpi.0.disabled=1 - debug.mpsafevfs=0 debug.mpsafevm=0 to no avail. If you have SMP and QUOTA enabled, you might find the following threads useful: http://lists.freebsd.org/pipermail/freebsd-hackers/2005-November/014339.html http://lists.freebsd.org/pipermail/freebsd-stable/2005-December/020606.html http://lists.freebsd.org/pipermail/freebsd-stable/2006-January/021693.html All of the possible workarounds I know are limited to the following: - disable SMP - disable QUOTA support - downgrade to 5.4(-STABLE) This should have been in errata since mid Nov 2005. Regards, Atanas I setup a serial console and go into ddb and get: KDB: enter: Line break on console [thread pid 61 tid 100052 ] Stopped at kdb_enter+0x30: leave db db bt Tracing pid 61 tid 100052 td 0xc355e300 kdb_enter(c075ddb2,a,9e0a90,417ff9,e69e0aac) at kdb_enter+0x30 siointr1(c3660800,a,2,2,6400) at siointr1+0xe7 siointr(c3660800,257,0,4,c355e300) at siointr+0x78 intr_execute_handlers(c352b090,e69e0aa0,e69e0af8,c06fee23,34) at intr_execute_handlers+0x98 lapic_handle_intr(34) at lapic_handle_intr+0x3a Xapic_isr1() at Xapic_isr1+0x33 --- interrupt, eip = 0xc057d1c0, esp = 0xe69e0ae4, ebp = 0xe69e0af8 --- vop_stdlock(c078e260,e69e0b50,e69e0b20,c0721424,e69e0b50) at vop_stdlock ffs_lock(e69e0b50,0,2012,c54db440,e69e0b6c) at ffs_lock+0x19 VOP_LOCK_APV(c078db60,e69e0b50,c3811400,e69e0b4c,c057d232) at VOP_LOCK_APV+0x54 vn_lock(c54db440,2012,c355e300,c079a980,c6029000) at vn_lock+0x13e vget(c54db440,2012,c355e300,c81c1aa0,1) at vget+0xff qsync(c3811400,c3811400,c355e300,c3b64660,0) at qsync+0x1df ffs_sync(c3811400,3,c355e300,c355e300,c355e300) at ffs_sync+0x3fb sync_fsync(e69e0ca0,e69e0cbc,c05876a4,c0782700,e69e0ca0) at sync_fsync+0x21b VOP_FSYNC_APV(c0782700,e69e0ca0,c355e300,0,e69e0cbc) at VOP_FSYNC_APV+0x3e sync_vnode(c39093f0,c355e300,68,c07493b5,0) at sync_vnode+0x1b4 sched_sync(0,e69e0d38,4489e045,458d1424,244489dc) at sched_sync+0x2ef fork_exit(c05877a0,0,e69e0d38) at fork_exit+0x80 fork_trampoline() at fork_trampoline+0x8 --- trap 0x1, eip = 0, esp = 0xe69e0d6c, ebp = 0 --- db ps pid proc uid ppid pgrp flag stat wmesgwchan cmd 94717 c3e64624 80 94700 94696 0004000 [RUNQ] rrdtool 94703 c3c086240 800 800 101 [RUNQ] rsync 94702 c3e6c000 80 94698 94698 0004000 [RUNQ] perl 94701 c3e6420c 113 94697 94697 0004001 [RUNQ] perl5.8.7 94700 c5f41a3c 80 94696 94696 0004000 [SLPQ wait 0xc5f41a3c][SLP] perl5.8.7 94698 c35fb418 80 94693 94698 0004000 [SLPQ wait 0xc35fb418][SLP] sh 94697 c3e6c20c 113 94694 94697 0004000 [SLPQ wait 0xc3e6c20c][SLP] sh 94696 c382a20c 80 94692 94696 0004000 [SLPQ wait 0xc382a20c][SLP] sh 94694 c35fb0000 710 710 000 [SLPQ piperd 0xc4adc660][SLP] cron 94693 c8160c480 710 710 000 [SLPQ piperd 0xc815b4c8][SLP] cron 94692 c5f406240 710 710 000 [SLPQ piperd 0xc38a9198][SLP] cron 94689 c815f0000 1 94689 100 [SLPQ select 0xc07a7404][SLP] sendmail 94584 c3e6ca3c 60667 678 678 100 [SLPQ lockf 0xc5c2ea80][SLP] perl5.8.7 94338 c3c08a3c 60667 678 678 100 [SLPQ lockf 0xc5c636c0][SLP] perl5.8.7 94300 c43ab8308 758 739 0004000 [SLPQ nanslp 0xc07a086c][SLP] sleep 93950 c389f624 60667 678 678 100 [SLPQ lockf 0xc5c2e380][SLP] perl5.8.7 93326 c43aaa3c 60667 678 678 100 [SLPQ select 0xc07a7404][SLP] perl5.8.7 93248 c3c088300 694 694 100 [SLPQ select 0xc07a7404][SLP] sendmail 92514 c43ab6248 754 754 0004000 [SLPQ sbwait 0xc4a43a64][SLP] perl5.8.7 92513 c3c084188 754 754 0004000 [SLPQ select 0xc07a7404][SLP] innfeed 88624 c816020c0 1 88624 0004002 [SLPQ ttyin 0xc3666010][SLP] getty 88573 c38274180 1 88573 0004002 [SLPQ ttyin 0xc3666410][SLP] getty 88570 c43a820c0 1 88570 0004002 [SLPQ ttyin 0xc3665410][SLP] getty 3820 c39ca6240 1 3820 0004002 [SLPQ ttyin 0xc3664c10][SLP] getty 905 c43ab20c0 1 905 000 [SLPQ select 0xc07a7404][SLP] inetd 896 c3e64c480 1 896 000 [SLPQ select 0xc07a7404][SLP] moused 872 c389c0000 168 000c082 (threaded) java thread 0xc60c9600 ksegrp 0xc389e1e0 [SLPQ kserel 0xc389e214][SLP] thread 0xc4f56180 ksegrp 0xc389e1e0 [SLPQ sbwait 0xc5e2ae90][SLP] thread 0xc49eda80 ksegrp 0xc389e1e0 [SLPQ sbwait 0xc6191a64][SLP] thread 0xc4f61180 ksegrp 0xc389e1e0 [SLPQ kserel 0xc389e214][SLP] thread 0xc49fb900 ksegrp 0xc389e1e0 [SLPQ sbwait 0xc3a37a64][SLP] thread 0xc4d13a80 ksegrp 0xc389e1e0 [SLPQ sbwait 0xc60d4e90][SLP] thread 0xc5f5c900 ksegrp 0xc389e1e0 [SLPQ accept 0xc3d9ab5a][SLP] thread 0xc4f22d80 ksegrp 0xc389e1e0 [SLPQ sbwait 0xc5eba20c][SLP] thread 0xc622ad80 ksegrp 0xc389e1e0 [SLPQ sbwait 0xc61914d4][SLP] thread 0xc3e65300
Re: diskio / filesystem related deadlock on SMP 6.0-STABLE machine.
Kris Kennaway said the following on 1/26/2006 3:46 PM: On Thu, Jan 26, 2006 at 06:44:22PM -0500, Kris Kennaway wrote: On Thu, Jan 26, 2006 at 06:37:16PM -0500, Mike Jakubik wrote: Kris Kennaway wrote: On Thu, Jan 26, 2006 at 05:07:56PM +0200, Niki Denev wrote: On Thursday 26 January 2006 10:40, Niki Denev wrote: [...] After i disabled option QUOTA in both my default kernel config and the one i compiled with the debugging options i was unable to reproduce the deadlock again. (i hope it stays that way :) ) This, together with the report in my previous post probably point that the problem is in the QUOTA support. Actually, I think this is known. Kris Well thats good to know, i was planning on upgrading a production box from 5 to 6, its SMP and uses QUOTA. How did 6 get released when QUOTA was known to cause deadlocks? FYI, you can probably work around this by setting debug.mpsafevfs=0. Of course, you'll lose the filesystem performance benefits. Kris I'd like to confirm that setting debug.mpsafevfs=0 (along with debug.mpsafevm=0 and debug.mpsafenet=0) doesn't help either. I ran into the same problem about a month ago on 2 production boxes running 6-STABLE (SMP with QUOTA enabled) and the only way I found to stop them crashing was switching back to 5.4. I hope this will get fixed in 6.1. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: 6.0-STABLE can't see floppy drive for KLD loading
Kim Culhan said the following on 12/29/05 14:10: Trying to install 6.0-STABLE on a Supermicro 5014C-MT server with a P8SCT motherboard. Also installed is a 3Ware 9550SX sata raid controller. The latest 6.0-STABLE snap from 12-08-05 does not have the version of the TWA driver which supports the 9550SX so it would be necessary to kldload the binary, available from the 3Ware web site. Running sysinstall and navigating to Configure--Load KLD the installation fails as it complains: Cant find the floopy The hardware appears to be good as it is possible to boot a dos floppy. Have not been successful finding a more recent snap.. the changes for the 9550SX controller were about a day late to be included in the 12-08 snap :( Any info in greatly appreciated You could do the following: - attach a standalone drive (IDE,SATA,SCSI) - install FreeBSD and upgrade it to *-STABLE - partition a 3ware drive (array, whatever) and make it bootable - transfer the OS installation from the standalone drive to the 3ware partition. Or you could create a bootable USB (hard, flash, iPod, digital camera, etc.) drive up to date with -STABLE (see /usr/src/tools/tools/nanobsd), boot the 3ware box from it, and install over the network. I haven't tried the latter myself, but already used a nanobsd based USB flash drive as emergency recovery media on a 9550SX based box, and it worked. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: 6.0 random freezes
Peter Jeremy said the following on 12/13/05 02:00: Note that PS/2 keyboards aren't hot-pluggable and attempts to do so can have deleterious effects on your keyboard and/or motherboard. In any case, the probe/attach sequence relies on the kernel being in a reasonably sane state (and I'm not sure if it will detect the keyboard as a console device except at boot time). I agree, but the keyboard is a passive device (with no power source, i.e. mostly harmless), and it's a standard practice to have only few movable consoles for several racks and plug them in only where it's necessary. It always has been working for us and I don't remember having any hot-plugging accidents for years. If the keyboard has been plugged in since the system booted, do you still get the same no response? If so, the kernel has wedged at a fairly low level and I'm not quite sure how to proceed other than by enabling the sanity checks that other people have mentioned (eg WITNESS, INVARIANTS) and hoping they catch something. I cannot say for sure. When the thing happens I'm usually away, and until I go there, the console could have been used by someone. I'm in process of getting a serial console, so if there's no response as well, I will enable the sanity checks. I only mentioned serial consoles on the off-chance that you had one. Whilst it may not help here, serial consoles have a number of advantages when managing remote equipment Thanks for pointing this. As I said I'm in process of getting one for now, and possibly equipping some dozens of servers with that later. After the downgrade we could eventually set a test bed and start hammering it with requests. The problem would be how to trigger the crash and whether we would be able to reproduce it at all. I already went to the 5.4 downgrade way. Actually I was forced to do so during the other night, when one of the machines started hanging up in every half an hour or so. Looks like the background fsck on the slower SATA based RAID5 array helped a lot with that. Now I have the test bed online. This is the very same server (SCSI based, with the OS drive intact and production data drives moved elsewhere) that was crashing once a day or so. Hopefully tomorrow I will have a serial console attached to it, so we can start pounding it. I hope this machine won't need to go in production during the next month or so and we'll have enough time for tests. Depending on your application and the interfaces to it, it might be feasible to either tee live traffic into both systems and just junk the responses from your test bed, or record live traffic and replay it into your test bed. It runs a fairly complex set of services. It's a shared web hosting server handling some hundreds of websites, and also email SMTP/POP3/IMAP, databases MySQL, FTP, DNS, etc. I don't know how easy would be implement such traffic gathering and replaying that on the test bed. It seems kind of complicated at first sight (though I realize it might be the only way to reproduce the crash). We might need some NAT (via ipfw?), some services might not like their responses being junked, etc. I was thinking about trying the kernel stress suite first. Or just have something rsync-ing lots files back and forth (possibly over the network), run apache bench in a loop and point it to some database intensive page, etc. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: 6.0 random freezes
Atanas said the following on 12/12/05 18:57: Peter Jeremy said the following on 12/12/05 13:40: When it hangs, break into DDB (Ctrl-Alt-Esc on the console or BREAK on a serial console). But if I have no keyboard response I won't be able to save it, right? (replying to myself) This is exactly what I was afraid would happen. The SATA based box just hung up again, with all of the kernel debugging options in place: makeoptions DEBUG=-g options KDB options DDB But I wasn't able to do anything with the keyboard in order to save a crashdump, so I got no other choices than hitting the reset button. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
6.0 random freezes
Hi, I have 3 machines running 6.0-RELEASE, and recently 2 of them started freezing once a day or so. There are no error messages on the console or in the system logs. The first one I put in production about a month ago and it was working flawlessly until it got some load and now it started freezing almost every day. The second one has exactly the same behavior - it was fine when doing nothing (a couple of weeks), and started freezing when loaded. The load I'm talking about is less than moderate (less that 2.0 with plenty of CPU idle time). The freezing thing also does not appear to happen at peak times (I have rrdtool based CPU load graphs). Both machines have (almost) identical motherboards: Intel SE7520JR2SCSID2 and SE7520JR2ATAD2 2 Intel XeonE 3.2GHz 800MHz CPUs 4GB DDRII400 RegECC RAM The first one has 8 72GB SEAGATE ST373207LC 0003 Ultra320 SCSI drives attached as plain drives (no raid) to the on-board LSILogic 1030 Ultra4 Adapter. The second one has 8 500GB SEAGATE ST3500641AS SATA2 drives attached to a 3ware Model 9550SX-8LP controller and configured as a RAID5 array. The motherboards have 2 1000Mbps NICs on board, but due to some (em) driver problems, I usually disable these from BIOS and use a PCI Intel 100Mbps (fxp) instead. Both machines were running 6.0-RELEASE, i386. For the last one I had to updated the twa driver manually, as the one shipped with 6.0 didn't support 3ware 9550SX. I see that new version recently got committed into the -STABLE branches. Here are the diffs against the GENERIC kernel configuration: cpu I486_CPU cpu I586_CPU makeoptions DEBUG=-g# Build kernel with gdb(1) debug symbols options INET6 # IPv6 communications protocols 53d47 options SCSI_DELAY=5000 # Delay (in ms) before probing SCSI options QUOTA options SMP # Symmetric MultiProcessor Kernel /boot/loader.conf: kern.ipc.nmbclusters=65536 /etc/stysctl.conf: kern.ipc.somaxconn=1024 net.inet.tcp.recvspace=16384 net.inet.ip.fw.verbose=1 machdep.hyperthreading_allowed=1 Both machines boot with ACPI and hyperthreading enabled. First I suspected the hardware, so I replaced the entire box (keeping the same drives) - no changes - it got frozen again in less than 24 hours. Then I disabled ACPI (hint.acpi.0.disabled=1) and the hyperthreading - no change - the same thing. Then after reading all related (I believe) postings here and in freebsd-current, I decided to upgrade both boxes to 6.0-STABLE (I saw a lot of changes in the source tree), but the thing continued to happen. I have another machine with the same hardware components (the SCSI based one), but running 5.4-RELEASE. Unlike these two, it's really loaded (even got DDoS-ed a while ago) and I had zero problems with it for months. I remember having similar issues when performing 4GB RAM upgrades on a bunch of 4.x based boxes, when I had to set KVA_PAGES to something like 512. For 5.3+ however this is no longer seems to be an issue. I would provide more useful feedback if I had some real and relevant error messages. Actually I got some unusual errors on only one of the affected servers: Dec 11 02:48:36 xyz kernel: calcru: runtime went backwards from 28636364 usec to 28636021 usec for pid 28588 (httpd) But it does not seem to be much relevant to the problem as it did not happened to be any close to the freezes (i.e. it was 26 hours after the last crash and 19 hours before the next one). Now the only reasonable option for me (I mean for production and in relatively short term) seems going downward to 5.4 and wait until 6.x get more stable Two dmesg.boot files attached. Any comments, suggestions and questions are welcome. Regards, Atanas Copyright (c) 1992-2005 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 6.0-STABLE #0: Fri Dec 9 14:54:05 PST 2005 [EMAIL PROTECTED]:/var/obj/usr/src/sys/XYZ ACPI APIC Table: A M I OEMAPIC Timecounter i8254 frequency 1193182 Hz quality 0 CPU: Intel(R) Xeon(TM) CPU 3.20GHz (3192.01-MHz 686-class CPU) Origin = GenuineIntel Id = 0xf43 Stepping = 3 Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE Features2=0x641dSSE3,RSVD2,MON,DS_CPL,CNTX-ID,CX16,b14 AMD Features=0x2010NX,LM Hyperthreading: 2 logical CPUs real memory = 3757965312 (3583 MB) avail memory = 3678597120 (3508 MB) FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 cpu2 (AP): APIC ID: 6 cpu3 (AP): APIC ID: 7 ioapic0: Changing APIC ID to 8 ioapic1: Changing APIC ID to 9 ioapic2: Changing APIC ID to 10 ioapic0 Version 2.0 irqs 0-23 on motherboard ioapic1 Version 2.0 irqs 24-47 on motherboard ioapic2
Re: 6.0 random freezes
Claus Guttesen said the following on 12/12/05 13:23: Both machines boot with ACPI and hyperthreading enabled. Try to disable HTT in bios. I think that I already achieved that by simply disabling the acpi module from device.hints, and it had no effect to the problem. It seldom gives you very much, and somtetimes degrades performance. Is it a webserver? It is a web server, and as such it tends to generate a lot of processes, many of them independent of each other and trying to run simultaneously. Thus more work horses (even less powerful virtual CPUs) make the server to perform smoother. This is just a practical observation though, and I could be wrong. I would rather go with 2 dual core Opterons, but these are sort of expensive for now. If it generates alot of temporary files you can try adding/changing the following in /etc/sysctl.conf: kern.ipc.somaxconn=2048 kern.maxfiles=65536 vfs.ufs.dirhash_maxmem=8388608 Currently I have the following: kern.ipc.somaxconn: 1024 kern.maxfiles: 12328 vfs.ufs.dirhash_maxmem: 2097152 kern.openfiles: 1992 It's closest relative (running 5.4-RELEASE on the same hardware) handles about twice more requests, temporary files, and open files. kern.openfiles there is about 4000, and if something tries to go above the limits, the kernel usually reports that. I have plenty of other boxes serving at least twice more requests with less powerful (also hyperthreaded) CPUs running 4.x and 5.x and with no problems. The ones I have problems with are way less loaded, and are supposedly faster ones. Thanks for your suggestions! Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: 6.0 random freezes
Ronald Klop said the following on 12/12/05 13:27: What happens if you set one of these sysctl values to 0? (This disables SMP changes from 5.4 to 6.0.) debug.mpsafevfs: 1 debug.mpsafenet: 1 debug.mpsafevm: 1 Thanks for the suggestion! I just did so and rebooted both machines, so we'll see. I remember unseting debug.mpsafenet before 5.4 due to some ipfw limitations, but didn't know about the other two. And is there a possibility (performance-wise) to build a kernel with WITNESS and/or INVARIANTS options compiled in. This will give more info about possible locking problems. Your system will run slower. And because of this the problem may not occur anymore, but it is worth the try. Both machines are not much loaded, so I could afford slowing them down a bit for a while (I hope it won't be several times slower). I will do that at some point later if the problem still persists. I hope I won't be forced to downgrade to 5.4, though I'm already working on that (just in case). Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: 6.0 random freezes
Peter Jeremy said the following on 12/12/05 13:40: Define freezing: Does it respond to pings? Can you switch VTYs? Do the num-lock/caps-lock LEDs respond? Do some processes seem to freeze before others? I used the word freeze instead of crash, because the latter often gets associated with some errors reported by the kernel in system logs or on the console. In this case there are absolutely no error messages. I have also remote logging enabled (on another machine over the network), but there's nothing either. When the thing happens, the server appears to respond to pings for the first few minutes, but everything goes down until I go to the data canter. When I plug a keyboard, there's no response at all - no LEDs, no VTYs, Ctrl-Alt-Esc, etc. You might think of hint.atkbd.0.flags not being set properly, but it's right (i.e. unchanged, it appears to default to that on i386 5.x+) and other machines with identical configuration do accept keyboard. I have no information about processes. Only the thing I have is a real time CPU load graph. I have a script tailing the output of a vmstat cpu 15 and drawing a graph with user/system/idle times, so according to that graph there are no load spikes or unusual variations before the crashes. The usual user/system/idle percentages look like 10/7/83. I suggest you add the following to your kernel config: options KDB # Enable kernel debugger support. options DDB # Support DDB. I just set these along with the DEBUG option below, and got the new kernel (from 6.0-RELEASE sources dated Dec 9) running on both machines, so we'll see. When it hangs, break into DDB (Ctrl-Alt-Esc on the console or BREAK on a serial console). As a start, run 'show lockedvnods' and 'ps'. My guess is that you'll see a lock that has a number of waiters - which is probably the culprit. Use 'panic' or 'call doadump' to get a crashdump and then you can use kgdb to rummage around once you reboot - see http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebg-gdb.html I don't have any experience in chasing kernel bugs, so I'm not sure whether I would be able to get something useful, but I'll try that on the next crash. But if I have no keyboard response I won't be able to save it, right? I do not know what a serial console is and would need some time to get along with it. Would I get something in addition to what I can get from the standard console? makeoptions DEBUG=-g # Build kernel with gdb(1) debug symbols I suggest you add this back in. Without it, you can't debug any crash dumps that you manage to get (and add dumpdev to your rc.conf). My bad, I realized that it's kind of harmless, but it was weeks later after I put the box in production. It's back there now. The dumpdev variable seems to default to AUTO, i.e. trying to use the first swap device if it's bigger than the RAM (in my case yes), so I guess I don't need to touch it. Whilst I realise that you can't have production machines freezing on schedule, your assistance in providing more information about your problem will help make 6.x more stable. Yes, I know and I will try. Today I already had a couple of crashes (got lucky, no nasty data corruptions this time), and I cannot afford this to continue. I'm already working on the downgrade, but most likely I will have at least one of these 2 machines still running 6.x during the next day or two. After the downgrade we could eventually set a test bed and start hammering it with requests. The problem would be how to trigger the crash and whether we would be able to reproduce it at all. Thanks for the prompt reply! Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: 6.0 random freezes
Atanas said the following on 12/12/05 15:43: Ronald Klop said the following on 12/12/05 13:27: What happens if you set one of these sysctl values to 0? (This disables SMP changes from 5.4 to 6.0.) debug.mpsafevfs: 1 debug.mpsafenet: 1 debug.mpsafevm: 1 Thanks for the suggestion! I just did so and rebooted both machines, so we'll see. (replying to myself) ... and coincidentally or not, I got the next crash in less than 10 minutes :-( After the crash it ran for longer, until I rebooted it after rebuilding the kernel with debug hookups. Before the reboot I commented these out (i.e. set them back to 1), and now I'm waiting for a crashdump. Regards, Atanas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]