from:"Atanas"

Re: twa: Passthru request timed out! Resetting controller...

2006-11-15 Thread Atanas


Mark Dotson said the following on 11/14/06 1:18 PM:
I've had continued problems with the 3ware series SATA cards and the 
Tyan boards.  Specifically, I have a Tyan S5360-1U and both a 
9500S-4LP and a 8506 series 3ware cards.


In my case the first error is different, but the 'resetting' over and 
over is VERY familiar.  This could be triggered by a simple file copy 
from one part of a container to another; degrading the unit and 
triggering the resetting crap.  Note that the drives are fine, I tested 
that first thing.


Sep  8 11:59:23 localhost kernel: 3w-9xxx: scsi0: WARNING: 
(0x06:0x002C): Unit #1: Command (0x2a) timed out, resetting card.

Sep  8 11:59:41 localhost kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E):
Cache synchronized after power fail:unit=0.
Sep  8 11:59:41 localhost kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E):
Cache synchronized after power fail:unit=1.

I also found this problem to exist across platforms, not just FreeBSD. 
For example, the excerpt above is from a CentOS box.


All tests were done with newest firmware for both card and mobo, and 
using the newest drivers provided by 3ware.


Once I removed the card and drives from the Tyan system and stuck them 
in pretty much ANY other system, they worked fantastically.


I don't have an answer for the resetting problem as of yet... 3ware 
and Tyan (And my system vendor Appro) are still trying to find my 
specific problem and solve it.  I believe they are currently doing the 
replace everything method of troubleshooting.



Mark, thank you.

It's good to know that the resetting problem exist on other platforms too.

We already found out that replacing the entire box with identical one 
doesn't help, so unfortunately we'll have to start replacing components 
by using different brands or models.


I wouldn't like to touch the I/O subsystem (these are already loaded 
production machines), so like you said, the safest bet would be to try 
another motherboard.


However I don't see many Dual Opteron based boards suggested by the 
3ware's compatibility list. The next one that comes in mind from that 
list is Supermicro H8DC8, but it looks more like a gamers dream 
(High-End PCI-e Graphics, SLI, etc. but no on-board VGA) than a server 
board.


I'm quite surprised that the top Opteron based motherboard manufacturer 
listed in the 3ware web site motherboard compatibility docs:
http://3ware.com/products/pdf/Motherboard_compatibility_list_9550SX_2006_06.pdf 

makes 2 out of 5 boards that are marked as compatible, but perform so 
bad with 3ware cards.


I know what happens here in this mailing list when somebody looks for 
good SATA cards (Re: 3ware, 3ware, ...), I replied myself too.


So are there any success stories with 3ware 9550SX (SATA II) and dual 
AMD Opteron server boards, or it's time to go back with Intel?


Regards,
Atanas



Atanas wrote:

Has anyone experiencing this:

twa0: ERROR: (0x05: 0x2018): Passthru request timed out!: request = 
0xca839d20

twa0: INFO: (0x16: 0x1108): Resetting controller...:
twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=0
...
twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=7
twa0: INFO: (0x04: 0x0001): Controller reset occurred: resets=1
twa0: INFO: (0x16: 0x1107): Controller reset done!:

This happens on 6.2-PRERELEASE i386 (and on 6.1 since its release) on 
a number of machines with the following hardware configuration:


- Tyan K8SE 2892, 2 AMD Opteron 270 CPUs, 4GB RAM
- 3ware 9550SX-8LP, 8 500GB Seagate ST3500641AS SATA drives
  (configured as 8 SINGLE DISK units, aka JBOD)

All hardware components, including the server chassis, are listed in 
the 3ware hardware compatibility lists. It doesn't seem to be a 
cabling or power issue. The controller and hard drives are already 
flashed to the latest firmware revisions. I tried turning off NCQ, but 
it didn't make any difference. I tried also switching the kernel from 
PAE to non-PAE (reducing the usable memory to 3GB), but it didn't help 
either.


I have another machines with similar I/O configurations (3ware), but 
with Intel motherboards and running FreeBSD-5.5, and these run fine 
for about a year already. Now I'm thinking about swapping the drives 
between a working Intel and AMD based box, to see where controller 
timeouts will follow.


The problem happens sporadically once in a month or so and is very 
hard to reproduce. Sometimes it takes several weeks until the next 
crash happens, sometimes it crashes again in just a few hours.


When the thing happens, the kernel sometimes panics (most likely due 
to the inconsistent filesystem state caused by the controller reset), 
sometimes just hangs. It can be interrupted (I have a serial console), 
but the only usable thing after that seems to be call cpu_reset(), 
followed by full (and sometimes painfully long) filesystem check.


Here are the diffs against the default GENERIC and PAE kernel 
configurations:


 cpu   I486_CPU
 ident GENERIC

twa: Passthru request timed out! Resetting controller...

2006-11-14 Thread Atanas


Has anyone experiencing this:

twa0: ERROR: (0x05: 0x2018): Passthru request timed out!: request = 
0xca839d20
twa0: INFO: (0x16: 0x1108): Resetting controller...: 

twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=0 


...
twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=7 

twa0: INFO: (0x04: 0x0001): Controller reset occurred: resets=1 

twa0: INFO: (0x16: 0x1107): Controller reset done!: 



This happens on 6.2-PRERELEASE i386 (and on 6.1 since its release) on a 
number of machines with the following hardware configuration:


- Tyan K8SE 2892, 2 AMD Opteron 270 CPUs, 4GB RAM
- 3ware 9550SX-8LP, 8 500GB Seagate ST3500641AS SATA drives
  (configured as 8 SINGLE DISK units, aka JBOD)

All hardware components, including the server chassis, are listed in the 
3ware hardware compatibility lists. It doesn't seem to be a cabling or 
power issue. The controller and hard drives are already flashed to the 
latest firmware revisions. I tried turning off NCQ, but it didn't make 
any difference. I tried also switching the kernel from PAE to non-PAE 
(reducing the usable memory to 3GB), but it didn't help either.


I have another machines with similar I/O configurations (3ware), but 
with Intel motherboards and running FreeBSD-5.5, and these run fine for 
about a year already. Now I'm thinking about swapping the drives between 
a working Intel and AMD based box, to see where controller timeouts will 
follow.


The problem happens sporadically once in a month or so and is very hard 
to reproduce. Sometimes it takes several weeks until the next crash 
happens, sometimes it crashes again in just a few hours.


When the thing happens, the kernel sometimes panics (most likely due to 
the inconsistent filesystem state caused by the controller reset), 
sometimes just hangs. It can be interrupted (I have a serial console), 
but the only usable thing after that seems to be call cpu_reset(), 
followed by full (and sometimes painfully long) filesystem check.


Here are the diffs against the default GENERIC and PAE kernel 
configurations:


 cpu   I486_CPU
 ident GENERIC
 options   INET6   # IPv6 communications protocols
 options   SCSI_DELAY=5000 # Delay (in ms) before probing SCSI

 options   QUOTA
 options   SMP # Symmetric MultiProcessor Kernel
 options   BREAK_TO_DEBUGGER
 options   DDB
 options   KDB
 options   KDB_UNATTENDED

 options   IPFIREWALL
 options   DUMMYNET

I'm attaching the dmesg.boot following the latest crash.

Regards,
Atanas

Copyright (c) 1992-2006 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 6.2-PRERELEASE #0: Mon Nov 13 17:47:40 PST 2006
[EMAIL PROTECTED]:/var/obj/usr/src/sys/XYZ-PAE
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: Dual Core AMD Opteron(tm) Processor 270 (2009.27-MHz 686-class CPU)
  Origin = AuthenticAMD  Id = 0x20f12  Stepping = 2
  
Features=0x178bfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT
  Features2=0x1SSE3
  AMD Features=0xe2500800SYSCALL,NX,MMX+,FFXSR,LM,3DNow+,3DNow
  AMD Features2=0x3LAHF,CMP
  Cores per package: 2
real memory  = 5368709120 (5120 MB)
avail memory = 4182241280 (3988 MB)
ACPI APIC Table: PTLTD  APIC  
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
 cpu2 (AP): APIC ID:  2
 cpu3 (AP): APIC ID:  3
ioapic0 Version 1.1 irqs 0-23 on motherboard
ioapic1 Version 1.1 irqs 24-27 on motherboard
ioapic2 Version 1.1 irqs 28-31 on motherboard
kbd1 at kbdmux0
acpi0: PTLTD   RSDT on motherboard
acpi0: Power Button (fixed)
Timecounter ACPI-fast frequency 3579545 Hz quality 1000
acpi_timer0: 24-bit timer at 3.579545MHz port 0x8008-0x800b on acpi0
cpu0: ACPI CPU on acpi0
cpu1: ACPI CPU on acpi0
cpu2: ACPI CPU on acpi0
cpu3: ACPI CPU on acpi0
acpi_button0: Power Button on acpi0
pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0
pci0: ACPI PCI bus on pcib0
pci0: memory at device 0.0 (no driver attached)
isab0: PCI-ISA bridge at device 1.0 on pci0
isa0: ISA bus on isab0
pci0: serial bus, SMBus at device 1.1 (no driver attached)
pci0: serial bus, USB at device 2.0 (no driver attached)
atapci0: nVidia nForce CK804 UDMA133 controller port 
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x1400-0x140f at device 6.0 on pci0
ata0: ATA channel 0 on atapci0
ata1: ATA channel 1 on atapci0
pcib1: ACPI PCI-PCI bridge at device 9.0 on pci0
pci1: ACPI PCI bus on pcib1
pci1: display, VGA at device 6.0 (no driver attached)
fxp0: Intel 82551 Pro/100 Ethernet port 0x2400-0x243f mem 
0xda101000-0xda101fff,0xda12-0xda13 irq 16 at device 8.0 on pci1
miibus0: MII bus on fxp0
inphy0: i82555 10/100 media interface on miibus0
inphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
fxp0: Ethernet address: 00

Re: twa: Passthru request timed out! Resetting controller...

2006-11-14 Thread Atanas


adam radford said the following on 11/14/06 11:56 AM:


Are you running the latest 3ware firmware on that controller?


Yep. It's in dmesg.boot:

twa0: INFO: (0x15: 0x1300): Controller details:: Model 9550SX-8LP, 8 
ports, Firmware FE9X 3.04.01.011, BIOS BE9X 3.04.00.002


That's the latest one released as 9.3.0.7 on the 3ware website. 
Yesterday flashed and rebooted them all, and this morning I got the next 
crash.


Regards,
Atanas



On 11/14/06, Atanas [EMAIL PROTECTED] wrote:

Has anyone experiencing this:

twa0: ERROR: (0x05: 0x2018): Passthru request timed out!: request =
0xca839d20
twa0: INFO: (0x16: 0x1108): Resetting controller...:

twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=0

...
twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=7

twa0: INFO: (0x04: 0x0001): Controller reset occurred: resets=1

twa0: INFO: (0x16: 0x1107): Controller reset done!:


This happens on 6.2-PRERELEASE i386 (and on 6.1 since its release) on a
number of machines with the following hardware configuration:

- Tyan K8SE 2892, 2 AMD Opteron 270 CPUs, 4GB RAM
- 3ware 9550SX-8LP, 8 500GB Seagate ST3500641AS SATA drives
   (configured as 8 SINGLE DISK units, aka JBOD)

All hardware components, including the server chassis, are listed in the
3ware hardware compatibility lists. It doesn't seem to be a cabling or
power issue. The controller and hard drives are already flashed to the
latest firmware revisions. I tried turning off NCQ, but it didn't make
any difference. I tried also switching the kernel from PAE to non-PAE
(reducing the usable memory to 3GB), but it didn't help either.

I have another machines with similar I/O configurations (3ware), but
with Intel motherboards and running FreeBSD-5.5, and these run fine for
about a year already. Now I'm thinking about swapping the drives between
a working Intel and AMD based box, to see where controller timeouts will
follow.

The problem happens sporadically once in a month or so and is very hard
to reproduce. Sometimes it takes several weeks until the next crash
happens, sometimes it crashes again in just a few hours.

When the thing happens, the kernel sometimes panics (most likely due to
the inconsistent filesystem state caused by the controller reset),
sometimes just hangs. It can be interrupted (I have a serial console),
but the only usable thing after that seems to be call cpu_reset(),
followed by full (and sometimes painfully long) filesystem check.

Here are the diffs against the default GENERIC and PAE kernel
configurations:

 cpu   I486_CPU
 ident GENERIC
 options   INET6   # IPv6 communications protocols
 options   SCSI_DELAY=5000 # Delay (in ms) before probing SCSI

  options   QUOTA
  options   SMP # Symmetric MultiProcessor Kernel
  options   BREAK_TO_DEBUGGER
  options   DDB
  options   KDB
  options   KDB_UNATTENDED

  options   IPFIREWALL
  options   DUMMYNET

I'm attaching the dmesg.boot following the latest crash.

Regards,
Atanas



Copyright (c) 1992-2006 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 6.2-PRERELEASE #0: Mon Nov 13 17:47:40 PST 2006
[EMAIL PROTECTED]:/var/obj/usr/src/sys/XYZ-PAE
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: Dual Core AMD Opteron(tm) Processor 270 (2009.27-MHz 686-class CPU)
  Origin = AuthenticAMD  Id = 0x20f12  Stepping = 2
  
Features=0x178bfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT 


  Features2=0x1SSE3
  AMD Features=0xe2500800SYSCALL,NX,MMX+,FFXSR,LM,3DNow+,3DNow
  AMD Features2=0x3LAHF,CMP
  Cores per package: 2
real memory  = 5368709120 (5120 MB)
avail memory = 4182241280 (3988 MB)
ACPI APIC Table: PTLTD  APIC  
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
 cpu2 (AP): APIC ID:  2
 cpu3 (AP): APIC ID:  3
ioapic0 Version 1.1 irqs 0-23 on motherboard
ioapic1 Version 1.1 irqs 24-27 on motherboard
ioapic2 Version 1.1 irqs 28-31 on motherboard
kbd1 at kbdmux0
acpi0: PTLTD   RSDT on motherboard
acpi0: Power Button (fixed)
Timecounter ACPI-fast frequency 3579545 Hz quality 1000
acpi_timer0: 24-bit timer at 3.579545MHz port 0x8008-0x800b on acpi0
cpu0: ACPI CPU on acpi0
cpu1: ACPI CPU on acpi0
cpu2: ACPI CPU on acpi0
cpu3: ACPI CPU on acpi0
acpi_button0: Power Button on acpi0
pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0
pci0: ACPI PCI bus on pcib0
pci0: memory at device 0.0 (no driver attached)
isab0: PCI-ISA bridge at device 1.0 on pci0
isa0: ISA bus on isab0
pci0: serial bus, SMBus at device 1.1 (no driver attached)
pci0: serial bus, USB at device 2.0 (no driver attached)
atapci0: nVidia nForce CK804 UDMA133 controller port 
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x1400-0x140f at device 6.0 on pci0

ata0: ATA channel

Re: QUOTA+SNAPSHOT

2006-08-14 Thread Atanas


Alexey Karagodov said the following on 8/14/06 10:17 AM:

hi everybody!
is QUOTA+SNAPSHOT problem solved? any work-around?

also i was tried to dd if=/dev/zero of=/var/swap/swap.bin bs=1m
count=32768 and server stop serving any requests except ping and
scroll-lock in console on 18983936K of writed data.
my dmesg.boot:

Copyright (c) 1992-2006 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
   The Regents of the University of California. All rights reserved.
FreeBSD 6.1-RELEASE-p3 #1: Tue Jul 18 15:02:07 MSD 2006

Not in 6.1-RELEASE AFAIK. I believe there are many fixes in 6-STABLE, 
however I haven't had a chance to test the snapshots so far.


I can say that for me 6-STABLE behaves not less stable that 
6.1-RELEASE-p3. I had 3 production machines (dual dual-core Opterons) 
running 6.1-R/i386+PAE and crashing once a week or so with QUOTAS 
enabled and SNAPSHOTS disabled (i.e. background_fsck=NO in rc.conf). 
Since upgrading to 6-STABLE I got 10 days of uptime and no crashes so far.


Regards,
Atanas

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: portupgrade bug: -M no longer works after v2.1.0

2006-07-13 Thread Atanas


Sergey Matveychuk said the following on 7/12/06 10:57 PM:

Atanas wrote:

Please, don't get me wrong. I'm not asking for help or for a workaround.
I'm actually trying to help identifying a problem or regression.

If this is not a bug, but a feature change, please have it documented.


It was a bug. Fixed. Thanks.


Thank you!
Atanas

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: portupgrade bug: -M no longer works after v2.1.0

2006-07-12 Thread Atanas


Sergey Matveychuk said the following on 7/11/2006 10:08 PM:

Atanas wrote:

Recent portupgrade versions no longer obey the -M command line switch,
i.e. any optional arguments to be prepended to each make command.

How to reproduce:

# portinstall -M APACHE_HARD_SERVER_LIMIT=1024 www/apache13


Everything work file. Use -m for getting what you want.

For www/apache13 the -m switch could give the same result as -M would, 
but I'm not sure whether it's not just a coincidence. The -m switch was 
supposed to serve a different purpose:


  -m
  --make-argsSpecify arguments to append to each make(1) com-
 mand line.
  -M
  --make-env Specify arguments to prepend to each make(1) com-
 mand line.

I tried testing another port where I used both:

# portinstall -M 'WITH_SYSLOG_FACILITY=local5' -m '-DWITHOUT_IPV6' 
mail/courier-imap


With portupgrade-2.0.1_1,1 (the stock 6.1-RELEASE package) it worked. 
With portupgrade-2.1.3.2,2 it failed (ignoring the -M part like for 
www/apache13 before).


Then I joined both in one -m switch:

# portinstall -m 'WITH_SYSLOG_FACILITY=local5 -DWITHOUT_IPV6' 
mail/courier-imap


and the latest portupgrade-2.1.3.2,2 did it just fine.

So, like you suggested, the -m switch seems to cover the functionality 
that -M used to provide. I'm not sure however whether this prepend to 
append conversion would work for all ports. But for these that I use it 
appears to work, so I have no problem and will update my scripts to use 
-m only.


The no longer working (obsolete?) -M switch would need to be removed 
from the man page though.


Thanks,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: portupgrade bug: -M no longer works after v2.1.0

2006-07-12 Thread Atanas


Sergey Matveychuk said the following on 7/12/06 3:24 AM:

Atanas wrote:

Sergey Matveychuk said the following on 7/11/2006 10:08 PM:

Atanas wrote:

Recent portupgrade versions no longer obey the -M command line switch,
i.e. any optional arguments to be prepended to each make command.

How to reproduce:

# portinstall -M APACHE_HARD_SERVER_LIMIT=1024 www/apache13

Everything work file. Use -m for getting what you want.


For www/apache13 the -m switch could give the same result as -M would,
but I'm not sure whether it's not just a coincidence. The -m switch was
supposed to serve a different purpose:

  -m
  --make-argsSpecify arguments to append to each make(1) com-
 mand line.
  -M
  --make-env Specify arguments to prepend to each make(1) com-
 mand line.

I tried testing another port where I used both:

# portinstall -M 'WITH_SYSLOG_FACILITY=local5' -m '-DWITHOUT_IPV6'
mail/courier-imap

With portupgrade-2.0.1_1,1 (the stock 6.1-RELEASE package) it worked.
With portupgrade-2.1.3.2,2 it failed (ignoring the -M part like for
www/apache13 before).

Then I joined both in one -m switch:

# portinstall -m 'WITH_SYSLOG_FACILITY=local5 -DWITHOUT_IPV6'
mail/courier-imap

and the latest portupgrade-2.1.3.2,2 did it just fine.

So, like you suggested, the -m switch seems to cover the functionality
that -M used to provide. I'm not sure however whether this prepend to
append conversion would work for all ports. But for these that I use it
appears to work, so I have no problem and will update my scripts to use
-m only.

The no longer working (obsolete?) -M switch would need to be removed
from the man page though.


Both -m and -M works fine but do different things. -m pass its argument
as make file argument(s) and -M pass its argument as environment
variable(s). You can't set make variable with environment variable. They
are different!

Yes, I know that. That's why I quoted the man page in my previous post 
(see above).



-M has never worked as you expected.

I expected -M to work exactly as documented, like it has been doing so 
for years.



You can test it with a command:
%cd /usr/ports/www/apache13
%env APACHE_HARD_SERVER_LIMIT=1024 make


Sure, the port build works:

% cd /usr/ports/www/apache13
% env APACHE_HARD_SERVER_LIMIT=1024 make
=== src/os/unixcc -c  -I../../os/unix -I../../include 
-I/usr/local/include  -funsigned-char -O2 -fno-strict-aliasing -pipe 
-DDOCUMENT_LOCATION=\/usr/local/www/data\ 
-DDEFAULT_PATH=\/bin:/usr/bin:/usr/local/bin\ -DHARD_SERVER_LIMIT=1024 
`../../apaci` os.c


while portupgrade -M fails:

% portinstall -M APACHE_HARD_SERVER_LIMIT=1024 www/apache13
=== src/os/unixcc -c  -I../../os/unix -I../../include 
-I/usr/local/include  -funsigned-char -O2 -fno-strict-aliasing -pipe 
-DDOCUMENT_LOCATION=\/usr/local/www/data\ 
-DDEFAULT_PATH=\/bin:/usr/bin:/usr/local/bin\ -DHARD_SERVER_LIMIT=512 
`../../apaci` os.c



against of
% make APACHE_HARD_SERVER_LIMIT=1024


This doesn't make much sense to me. Perhaps you meant:

% make -DAPACHE_HARD_SERVER_LIMIT=1024

But I wouldn't expect this to succeed either:

=== src/os/unixcc -c  -I../../os/unix -I../../include 
-I/usr/local/include  -funsigned-char -O2 -fno-strict-aliasing -pipe 
-DDOCUMENT_LOCATION=\/usr/local/www/data\ 
-DDEFAULT_PATH=\/bin:/usr/bin:/usr/local/bin\ -DHARD_SERVER_LIMIT=512 
`../../apaci` os.c


Your suggestion however (to replace -M with -m) surprisingly worked:

cc -c  -I../../os/unix -I../../include -I/usr/local/include 
-funsigned-char -O2 -fno-strict-aliasing -pipe 
-DDOCUMENT_LOCATION=\/usr/local/www/data\ 
-DDEFAULT_PATH=\/bin:/usr/bin:/usr/local/bin\ -DHARD_SERVER_LIMIT=1024 
`../../apaci` os.c


And this is what's confusing.


I think you confuse the two variables types.


No, I think I know what these are for.

But trying your suggestion to replace -M with -m and finding it to work 
(for some ports?), just threw some more fog into the case.


Let's say it clear again - I have found that all recent versions of 
portupgrade (2.1.0+) fail to obey the -M switch and ignore any optional 
port parameters (i.e. arguments to prepend to each make command line) 
supplied there.


Please, don't get me wrong. I'm not asking for help or for a workaround. 
I'm actually trying to help identifying a problem or regression.


If this is not a bug, but a feature change, please have it documented. 
What the portupgrade(1) man page says about the -M switch is incorrect, 
as it no longer prepends any arguments specified to each make(1) command 
line.


Regards,
Atanas

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

portupgrade bug: -M no longer works after v2.1.0

2006-07-11 Thread Atanas

Recent portupgrade versions no longer obey the -M command line switch, 
i.e. any optional arguments to be prepended to each make command.


How to reproduce:

# portinstall -M APACHE_HARD_SERVER_LIMIT=1024 www/apache13
...
=== src/ap
cc -c  -I../os/unix -I../include -I/usr/local/include  -funsigned-char 
-O2 -fno-strict-aliasing -pipe 
-DDOCUMENT_LOCATION=\/usr/local/www/data\ 
-DDEFAULT_PATH=\/bin:/usr/bin:/usr/local/bin\ -DHARD_SERVER_LIMIT=512 
`../apaci` ap_cpystrn.c

...

Note the -DHARD_SERVER_LIMIT=512 above.

The stock version shipped with 6.1-RELEASE (2.0.1_1) seems to work as 
expected.


I tried all CVS versions after that (with portdowngrade) and found that 
the breakage has happened somewhere between 2.1.0 and 2.1.1 (2006/06/02 
- 2006/06/04).


Regards,
Atanas



___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: portupgrade bug: -M no longer works after v2.1.0

2006-07-11 Thread Atanas


Matthias Andree said the following on 7/11/06 1:48 PM:

Atanas [EMAIL PROTECTED] writes:


Recent portupgrade versions no longer obey the -M command line switch,
i.e. any optional arguments to be prepended to each make command.

How to reproduce:

# portinstall -M APACHE_HARD_SERVER_LIMIT=1024 www/apache13
...
=== src/ap
cc -c  -I../os/unix -I../include -I/usr/local/include  -funsigned-char
-O2 -fno-strict-aliasing -pipe
-DDOCUMENT_LOCATION=\/usr/local/www/data\
-DDEFAULT_PATH=\/bin:/usr/bin:/usr/local/bin\ -DHARD_SERVER_LIMIT=512
`../apaci` ap_cpystrn.c
...

Note the -DHARD_SERVER_LIMIT=512 above.


Does it work if you type (you can omit the env in /bin/sh, bash, (pd)ksh
and other Bourne-like shells):

env APACHE_HARD_SERVER_LIMIT=1024 portinstall www/apache13

Of course it would, but this just bypasses the problem. There are other 
ways to work this around as well - like not using portupgrade at all and 
building everything with make.


The problem is that there's a bug introduced by some of the recent 
portupgrade versions that changes its documented behavior. The '-M' 
switch in partucular no longer works, thus causing any existing 
port/package installation scripts depending on that switch to build 
packages with incorrect optional parameters.


It's not a problem with a particular port. The www/apache13 port was 
given just as example how to reproduce the bug.


This affects _all_ ports when installed/upgraded/built via portupgrade 
and when the '-M' switch is used.



(Isn't it time to migrate to a newer Apache version anyways? 8-) )

(This is a long subject and kind of off-topic here. My short answer is 
no, or not yet. In some environments there are still legitimate reasons 
to use 1.3)


Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: em device hangs on ifconfig alias ...

2006-07-10 Thread Atanas


Pyun YongHyeon said the following on 7/7/06 8:32 PM:

On Fri, Jul 07, 2006 at 10:38:01PM +0100, Robert Watson wrote:
  
  Yes -- basically, there are two problems:
  
  (1) A little problem, in which an arp announcement is sent before the link 
  has

  settled after reset.
  
  (2) A big problem, in which the interface is gratuitously recent requiring

  long settling times.
  
  I'd really like to see a fix to the second of these problems (not resetting 
  when an IP is added or removed, resulting in link renegotiation); the first 
  one I'm less concerned about, although it would make some amount of sense 
  to do an arp announcement when the link goes up.
  


Ah, I see. Thanks for the insight.
How about the attached patch?

This patch seems to fix both of the issues, or at least this is what I 
see now:

- the card no longer gets reset when adding an alias;
- the arp packet gets delivered;
- adding 250 aliases takes less than a second;

I haven't fully tested whether all 250 IP aliases were accessible (I 
used non-routable IP addresses), but I suppose so. Also I couldn't 
stress the patched driver enough to see whether it performs as expected.


But in overall it looks good. I guess some more testing might be needed 
in order to merge the patch into the source tree.


Regards,
Atanas






Index: if_em.c
===
RCS file: /pool/ncvs/src/sys/dev/em/if_em.c,v
retrieving revision 1.116
diff -u -r1.116 if_em.c
--- if_em.c 6 Jun 2006 08:03:49 -   1.116
+++ if_em.c 8 Jul 2006 03:30:36 -
@@ -67,6 +67,7 @@
 
 #include netinet/in_systm.h

 #include netinet/in.h
+#include netinet/if_ether.h
 #include netinet/ip.h
 #include netinet/tcp.h
 #include netinet/udp.h
@@ -692,6 +693,9 @@
 
 	EM_LOCK_ASSERT(sc);
 
+	if ((ifp-if_drv_flags  (IFF_DRV_RUNNING|IFF_DRV_OACTIVE)) !=

+   IFF_DRV_RUNNING)
+   return;
if (!sc-link_active)
return;
 
@@ -745,6 +749,7 @@

 {
struct em_softc *sc = ifp-if_softc;
struct ifreq *ifr = (struct ifreq *)data;
+   struct ifaddr *ifa = (struct ifaddr *)data;
int error = 0;
 
 	if (sc-in_detach)

@@ -752,9 +757,22 @@
 
 	switch (command) {

case SIOCSIFADDR:
-   case SIOCGIFADDR:
-   IOCTL_DEBUGOUT(ioctl rcv'd: SIOCxIFADDR (Get/Set Interface 
Addr));
-   ether_ioctl(ifp, command, data);
+   if (ifa-ifa_addr-sa_family == AF_INET) {
+   /*
+* XXX
+* Since resetting hardware takes a very long time
+* we only initialize the hardware only when it is
+* absolutely required.
+*/
+   ifp-if_flags |= IFF_UP;
+   if (!(ifp-if_drv_flags  IFF_DRV_RUNNING)) {
+   EM_LOCK(sc);
+   em_init_locked(sc);
+   EM_UNLOCK(sc);
+   }
+   arp_ifinit(ifp, ifa);
+   } else
+   error = ether_ioctl(ifp, command, data);
break;
case SIOCSIFMTU:
{
@@ -802,17 +820,19 @@
IOCTL_DEBUGOUT(ioctl rcv'd: SIOCSIFFLAGS (Set Interface 
Flags));
EM_LOCK(sc);
if (ifp-if_flags  IFF_UP) {
-   if (!(ifp-if_drv_flags  IFF_DRV_RUNNING)) {
+   if ((ifp-if_drv_flags  IFF_DRV_RUNNING)) {
+   if ((ifp-if_flags ^ sc-if_flags) 
+   IFF_PROMISC) {
+   em_disable_promisc(sc);
+   em_set_promisc(sc);
+   }
+   } else
em_init_locked(sc);
-   }
-
-   em_disable_promisc(sc);
-   em_set_promisc(sc);
} else {
-   if (ifp-if_drv_flags  IFF_DRV_RUNNING) {
+   if (ifp-if_drv_flags  IFF_DRV_RUNNING)
em_stop(sc);
-   }
}
+   sc-if_flags = ifp-if_flags;
EM_UNLOCK(sc);
break;
case SIOCADDMULTI:
@@ -878,8 +898,8 @@
break;
}
default:
-   IOCTL_DEBUGOUT1(ioctl received: UNKNOWN (0x%x), (int)command);
-   error = EINVAL;
+   error = ether_ioctl(ifp, command, data);
+   break;
}
 
 	return (error);

Index: if_em.h
===
RCS file: /pool/ncvs/src/sys/dev/em/if_em.h,v
retrieving revision 1.44
diff -u -r1.44 if_em.h
--- if_em.h 15 Feb 2006 08:39:50

Re: em device hangs on ifconfig alias ...

2006-07-07 Thread Atanas


Robert Watson said the following on 7/7/06 7:17 AM:
 I just left a tcpdump -n arp host 10.10.64.40 on a third machine  
sniffing around and tested all em module versions I had (the stock 
6.1,  6-STABLE and 6-STABLE with your patch), but got silence on all 
three:


That's odd. I've tested it on CURRENT and I could see the ARP packet. 
Are you sure you patched correctly? If so I have to build a RELENG_6 
machine and give it try.


Is it possible you're seeing an interaction between the reset generated 
as part of IP address changing, and the time it takes to negotiate 
link?  It's possible that the arp packets are being eaten during the 
link negotiation, so for systems negotiating quickly (or not at all) 
then the arp packet is seen on other hosts, and otherwise not...



Looks like this is exactly what happens.

I was able to see it by running two tcpdump instances - one on the EM 
machine running in background and another running elsewhere on the same 
subnet.


So on the EM machine the arp packet actually gets generated by em(4) and 
caught by the tcpdump running there:


EM# tcpdump -n arp and ether src 00:04:23:b5:1b:ff 
EM#
EM# ifconfig em1 inet alias 10.10.64.40
EM# 11:28:37.178946 arp who-has 10.10.64.40 tell 10.10.64.40
EM#

But it doesn't reach the other tcpdump instance running on another host. 
It seems that the arp packet gets killed before leaving the EM machine, 
due to the card initialization or something else.


I tried sending it manually with arping, just to make sure both tcpdumps 
operate properly and yes, the packet got delivered to both.


I think that I have patched, built and loaded the em(4) kernel module 
correctly. After applying the patch there were no rejects, before 
building the module I intentionally appended  (patched) to its version 
string in if_em.c, and could see that in dmesg every time I loaded the 
module:

em1: Intel(R) PRO/1000 Network Connection Version - 3.2.18 (patched)

Regards,
Atanas

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: em device hangs on ifconfig alias ...

2006-07-06 Thread Atanas


Pyun YongHyeon said the following on 7/5/06 7:14 PM:


Here is patch generated against RELENG_6.


OK, I just tested that, but it doesn't seem to make any difference.

Here's what I did:

I commented out the em device from my kernel (a 6-STABLE one from 
yesterday) and compiled three if_em kernel modules:

- one taken from 6.1 release
- the unpatched 6-STABLE one
- the latter with the above patch applied

So I was able to load and test each of these modules independently and 
without actually restarting the machine. I changed also the driver 
version string in if_em.c, just to ensure that I'm really loading the 
right em module by checking dmesg:


em1: Intel(R) PRO/1000 Network Connection Version - 3.2.18 (patched) 
port 0xdc80-0xdcbf mem 0xfcfe-0xfcff irq 55 at device 4.1 on pci3

em1: Ethernet address: 00:04:23:b5:1b:ff
em1: link state changed to UP

I used 2 machines - one running 6.1-RELEASE and using fxp (I'll call it 
FXP), and the test one running 6-STABLE with em (I'll call it EM), 
and tried exchanging/moving an IP alias between them.


FXP# ifconfig
fxp0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST mtu 1500
options=bRXCSUM,TXCSUM,VLAN_MTU
inet 10.10.64.30 netmask 0xff00 broadcast 10.10.64.255
ether 00:e0:81:31:f4:1e
media: Ethernet autoselect (100baseTX full-duplex)
status: active

EM# ifconfig
em1: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST mtu 1500
options=bRXCSUM,TXCSUM,VLAN_MTU
inet 10.10.64.63 netmask 0xff00 broadcast 10.10.64.255
ether 00:04:23:b5:1b:ff
media: Ethernet autoselect (100baseTX full-duplex)
status: active

First I brought up an IP alias on the FXP machine:

FXP# ifconfig fxp0 inet alias 10.10.64.40 netmask 255.255.255.255

and checked whether it's accessible from anywhere - yes. Then I moved 
that to EM:


FXP# ifconfig fxp0 inet -alias 10.10.64.40
EM# ifconfig em1 inet alias 10.10.64.40 netmask 255.255.255.255

and checked again - no. It was accessible only from its own subnet 
(10.10.64.x), but not from anywhere else.


Moving that back to FXP works, but moving it back to EM doesn't. The 
only way I found to make it accessible was to arping something from the 
aliased IP address:


EM# arping -S10.10.64.40 -c1 somehost

So it seems that when an IP alias has been recently used on some other 
machine (on FXP in my case), the em driver is unable to initialize that 
IP alias properly.


It might be that the fxp driver is not sending something when releasing 
an alias, who knows. But fact is that fxp always initializes its aliases 
properly - I use it extensively and it always worked.


I tried setting another IP alias that never has been used on these 
machines. I brought that up first on EM and it worked. The moved it to 
FXP and it also worked! But moving it back to EM made it inaccessible.


It looks like there's something fishy with the alias initialization.

Another related problem is that the card gets re-initialized (reset?) on 
each alias you add (takes between 0.3 and 1 seconds, depending how fast 
the hardware is), which for mass aliased systems could be a serious 
hurdle after a crash or reboot.


Regards,
Atanas






--- if_em.c.origFri May 19 09:19:57 2006
+++ if_em.c Thu Jul  6 11:10:56 2006
@@ -657,8 +657,9 @@
 
 	mtx_assert(adapter-mtx, MA_OWNED);
 
-if (!adapter-link_active)

-return;
+   if ((ifp-if_drv_flags  (IFF_DRV_RUNNING|IFF_DRV_OACTIVE)) !=
+   IFF_DRV_RUNNING)
+   return;
 
 while (!IFQ_DRV_IS_EMPTY(ifp-if_snd)) {
 
@@ -719,11 +720,6 @@

if (adapter-in_detach) return(error);
 
 	switch (command) {

-   case SIOCSIFADDR:
-   case SIOCGIFADDR:
-   IOCTL_DEBUGOUT(ioctl rcv'd: SIOCxIFADDR (Get/Set Interface 
Addr));
-   ether_ioctl(ifp, command, data);
-   break;
case SIOCSIFMTU:
{
int max_frame_size;
@@ -760,16 +756,17 @@
IOCTL_DEBUGOUT(ioctl rcv'd: SIOCSIFFLAGS (Set Interface 
Flags));
EM_LOCK(adapter);
if (ifp-if_flags  IFF_UP) {
-   if (!(ifp-if_drv_flags  IFF_DRV_RUNNING)) {
+   if ((ifp-if_drv_flags  IFF_DRV_RUNNING)) {
+   if ((ifp-if_flags ^ adapter-if_flags) 
+   IFF_PROMISC) {
+   em_disable_promisc(adapter);
+   em_set_promisc(adapter);
+   }
+   } else
em_init_locked(adapter);
-   }
-
-   em_disable_promisc(adapter);
-   em_set_promisc(adapter);
} else {
-   if (ifp-if_drv_flags  IFF_DRV_RUNNING

Re: em device hangs on ifconfig alias ...

2006-07-06 Thread Atanas


Pyun YongHyeon said the following on 7/6/06 6:03 PM:


Hmm, that's strange. I've double checked that stock em(4) didn't
generate ARP packets when its addresses were changed. So I made
em(4) generate ARP. Could you see a gratuitous ARP with tcpdump
when you change its address?

I just left a tcpdump -n arp host 10.10.64.40 on a third machine 
sniffing around and tested all em module versions I had (the stock 6.1, 
6-STABLE and 6-STABLE with your patch), but got silence on all three:


EM# ifconfig em1 inet alias 10.10.64.40
nothing
EM# ifconfig em1 inet -alias 10.10.64.40
nothing

The fxp driver appears to send something on startup and nothing on 
shutdown:


FXP# ifconfig fxp0 inet alias 10.10.64.40
18:41:54.584059 arp who-has 10.10.64.40 tell 10.10.64.40
FXP# ifconfig fxp0 inet -alias 10.10.64.40
nothing

When I manually arping the em alias after startup (i.e. simulate what 
fxp does), everything works as expected:


EM# ifconfig em1 inet alias 10.10.64.40
nothing
EM# arping -c1 -S10.10.64.40 10.10.64.40
18:46:07.808701 arp who-has 10.10.64.40 tell 10.10.64.40
EM# ifconfig em1 inet -alias 10.10.64.40
nothing

It appears that this is what the em driver is supposed to do, or at 
least fxp does it in this way.



This is other issue. em(4) performs two time-consuming operations
in its initialization routine. One is DMA tag/map creation and the
other is checksumming EEPROM contents in init routine.
I have an experimental patch for it but let's fix one at a time.


OK, let's put that aside for now.

Regards,
Atanas

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: em device hangs on ifconfig alias ...

2006-07-05 Thread Atanas


Pyun YongHyeon said the following on 6/30/06 8:54 PM:

On Fri, Jun 30, 2006 at 12:28:49PM -0700, Atanas wrote:
  User Freebsd said the following on 6/29/06 9:29 PM:
  
  The other funny thing about the current em driver is that if you move an 
  IP to it from a different server, the appropriate ARP packets aren't 
  sent out to redirect the IP traffic .. recently, someone pointed me to 
  arping, which has solved my problem *external* to the driver ...

  
  That's the second reason why I (still) avoid em in mass-aliased systems.
  
  I have a single pool of IP addresses shared by many servers with 
  multiple aliases each. When someone leaves and frees an IP, it gets 
  reused and brought up on a different server. In case it was previously 
  handled by em, the traffic doesn't get redirected to the new server.
  
  Similar thing happens even with machines with single static IPs. For 
  instance when retiring an old production system, I usually request a new 
  box to be brought up on a different IP, make a fresh install on 
  everything and test, swap IP addresses and reboot. In case of em, after 
  a soft reboot both systems are inaccessible.
  
  A workaround is to power both of the systems down and then power them 
  up. This however cannot be done remotely and in case there were IP 
  aliases, they still don't get any traffic.
  


I haven't fully tested it but what about attached patch?
It may fix your ARP issue. The patch also fixes other issues
related with ioctls.
Now em(4) will send a ARP packet when its IP address is changed even
if there is no active link. Since em(4) is not mii-aware driver I
can't sure this behaviour is correct.

The patch is against if_em.c,v 1.116 2006/06/06, which is 7-CURRENT. I 
tried merging the relevant em driver files into a 6-STABLE 
installation by simply copying sys/dev/em/* and sys/modules/em/Makefile, 
but it seems that the new revision depends on other -CURRENT things and 
the module build fails:


# pwd
/usr/src/sys/modules/em
# make clean; make
...
/usr/src/sys/modules/em/../../dev/em/if_em.c: In function 
`em_setup_interface':
/usr/src/sys/modules/em/../../dev/em/if_em.c:2143: error: 
`IFCAP_VLAN_HWCSUM' undeclared (first use in this function)

...

I don't have a 7-CURRENT based box around. It seems too bleeding edge 
for me anyway. I was hoping to play with different if_em kernel modules 
on a semi-production (spare) box and eventually test the proposed em 
patch, but apparently it's not so easy.


Please let me know if I'm missing something obvious.

Thanks,
Atanas






Index: if_em.c
===
RCS file: /pool/ncvs/src/sys/dev/em/if_em.c,v
retrieving revision 1.116
diff -u -r1.116 if_em.c
--- if_em.c 6 Jun 2006 08:03:49 -   1.116
+++ if_em.c 1 Jul 2006 03:51:41 -
@@ -692,7 +692,8 @@
 
 	EM_LOCK_ASSERT(sc);
 
-	if (!sc-link_active)

+   if ((ifp-if_drv_flags  (IFF_DRV_RUNNING|IFF_DRV_OACTIVE)) !=
+   IFF_DRV_RUNNING)
return;
 
 	while (!IFQ_DRV_IS_EMPTY(ifp-if_snd)) {

@@ -751,11 +752,6 @@
return (error);
 
 	switch (command) {

-   case SIOCSIFADDR:
-   case SIOCGIFADDR:
-   IOCTL_DEBUGOUT(ioctl rcv'd: SIOCxIFADDR (Get/Set Interface 
Addr));
-   ether_ioctl(ifp, command, data);
-   break;
case SIOCSIFMTU:
{
int max_frame_size;
@@ -802,17 +798,19 @@
IOCTL_DEBUGOUT(ioctl rcv'd: SIOCSIFFLAGS (Set Interface 
Flags));
EM_LOCK(sc);
if (ifp-if_flags  IFF_UP) {
-   if (!(ifp-if_drv_flags  IFF_DRV_RUNNING)) {
+   if ((ifp-if_drv_flags  IFF_DRV_RUNNING)) {
+   if ((ifp-if_flags ^ sc-if_flags) 
+   IFF_PROMISC) {
+   em_disable_promisc(sc);
+   em_set_promisc(sc);
+   }
+   } else
em_init_locked(sc);
-   }
-
-   em_disable_promisc(sc);
-   em_set_promisc(sc);
} else {
-   if (ifp-if_drv_flags  IFF_DRV_RUNNING) {
+   if (ifp-if_drv_flags  IFF_DRV_RUNNING)
em_stop(sc);
-   }
}
+   sc-if_flags = ifp-if_flags;
EM_UNLOCK(sc);
break;
case SIOCADDMULTI:
@@ -878,8 +876,8 @@
break;
}
default:
-   IOCTL_DEBUGOUT1(ioctl received: UNKNOWN (0x%x), (int)command);
-   error = EINVAL;
+   error = ether_ioctl(ifp, command, data);
+   break;
}
 
 	return (error);

Index: if_em.h

Re: em device hangs on ifconfig alias ...

2006-06-30 Thread Atanas


Michael Vince said the following on 6/29/06 8:53 PM:


The thing that have to ask is if Atanas has 100's why can't he just boot 
Freebsd have have them all prebound to the interface at startup, why 
would you need to add and remove them constantly by the hundreds during 
normal server uptime?


I wasn't talking about the normal server uptime. Sooner or later, 
regardless of how perfect the hardware is and how great the OS performs, 
you will have to reboot. At least once or twice a year to update the 
kernel and/or world. Even in such rare occasions several minutes of 
additional downtime per reboot (in my case) are not justifiable.


I know that in a perfect world this downtime could be scheduled. But I 
prefer to keep the option to quickly reboot my systems when necessary. 2 
 vs 10-15 minutes downtime per reboot really makes a difference.


How you bind the aliases doesn't really matter - you always end up 
waiting the em driver to reset the card on each alias.


And don't get me wrong, I do use em for years on many machines having 
one static IP or a few additional static aliases and it works great. It 
just doesn't fit well in mass-alias configurations.


Regards,
Atanas

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: em device hangs on ifconfig alias ...

2006-06-30 Thread Atanas


User Freebsd said the following on 6/29/06 9:29 PM:


The other funny thing about the current em driver is that if you move an 
IP to it from a different server, the appropriate ARP packets aren't 
sent out to redirect the IP traffic .. recently, someone pointed me to 
arping, which has solved my problem *external* to the driver ...



That's the second reason why I (still) avoid em in mass-aliased systems.

I have a single pool of IP addresses shared by many servers with 
multiple aliases each. When someone leaves and frees an IP, it gets 
reused and brought up on a different server. In case it was previously 
handled by em, the traffic doesn't get redirected to the new server.


Similar thing happens even with machines with single static IPs. For 
instance when retiring an old production system, I usually request a new 
box to be brought up on a different IP, make a fresh install on 
everything and test, swap IP addresses and reboot. In case of em, after 
a soft reboot both systems are inaccessible.


A workaround is to power both of the systems down and then power them 
up. This however cannot be done remotely and in case there were IP 
aliases, they still don't get any traffic.


I have a third machine that uses an em driver, but its an older 4.x 
kernel, and it operates perfectly ... no timeouts/hangs and sends out 
the appropriate ARP packet ... all three servers are connected to the 
same Cisco switch, with all ports configured identically, so it isn't a 
switch issue, as someone else intimated ...



This seems strange, could depend on the chip version, who knows.

I still have many 4.x based machines, and both em issues (the card reset 
on each alias and the arp packets not been sent when going down) were 
present when I was doing my tests.


I check for these once in a while (a year or so), usually with the 
latest major release branch.


We had a compatibility issue about a year ago with a (rather exotic?) 
fiber NIC - 82545GM, where FreeBSD-4.x did better. The em driver coming 
with 5.x didn't support that (or wasn't working as expected, I don't 
remember the specifics), while the one coming with 4.x did, so we ended 
up installing 4.11 then.


Regards,
Atanas

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: em device hangs on ifconfig alias ...

2006-06-30 Thread Atanas


User Freebsd said the following on 6/30/06 1:48 PM:


see 'arping' ... great little tool, solved all my problems as far as 
moving around IPs ...



Thanks for the tip, I will try it next time.

I still have many 4.x based machines, and both em issues (the card 
reset on each alias and the arp packets not been sent when going down) 
were present when I was doing my tests.


Right, what version of 4.x?  The one that I have working is from ~Feb 
2005 .. if I were to upgrade that to the latest 4-STABLE, it would break 
like the rest ... the older 4.x had a different em driver in the kernel 
then the newer one ...


The problem was initially discovered back in 2003 (must have been with 
4.8 or 4.9) and after switching back to fxp I haven't tested the 4.x 
branch any more.


I remember testing 5.x around the 5.3 release (2004) and 6.x shortly 
after 6.0 (2005), and both em driver versions shipped with these 
releases were having the same issues.


I haven't tested it this year yet, but even in case it's fixed, it's not 
likely that all improvements will get back-ported to the older branches 
(over 90% of my servers run something older than 6.x). But as long as 
fxp works, this is a non-issue for me.


Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Parallel fsck in non-preen/full mode?

2006-06-30 Thread Atanas

Is there some easy way to force a full (non-preen) and at the same time 
parallel (i.e. one process per disk) fsck?


It could be a real down time saver in crash recovery situations. Imagine 
the following (fairly typical in my case) scenario:


You have many machines with some bunch of drives each and many files on 
each drive. After a crash (due to a hardware failure or else), the 
initial preen (fsck -p) fails. You have the following options:


a) rely on the background fsck available for 5.x and up;
b) set fsck_y_enable to YES to do fsck -y if the initial preen fails.
c) fsck it manually via local or serial console;

Background fsck relies on snapshots, which don't cope well with user 
quotas and often deadlocks and causes more crashes. Actually the QUOTA + 
snapshots combination worked somewhat better in 5.x than in 6.x now. For 
6.1 it's no longer an option for me.


An fsck -y is slow as hell as it doesn't run in parallel. For instance 
6 72GB drives (each about 75% full with a million of files) could take 
good 2 hours, primarily because fsck assumes that interaction is 
required and runs the checks one at a time.


Manual fsck needs attention (additional down time), and the fastest way 
to bring the machine back up is to do exactly the same what a fsck -p 
would to, but in _full_ mode, i.e.:


  # fsck -y da0s1a
  # fsck -y da0s1d 
  # fsck -y da1s1d 
  ...
  # fsck -y da7s1d 

  # ps ax |grep fsck
  # ...
  # exit

The above takes just 15 minutes or so, plus the time between the moment 
when the crash actually happens and the moment you start typing on the 
console (which sometimes could be much more than 15 minutes).


This could be automated by putting something similar (plus perhaps some 
shell code taking device entries from /etc/fstab and a cycle waiting for 
the fsck processes to finish) in /etc/rc.early or a separate rc.d/ style 
script. But such a hack I think would look somewhat ugly in shell and 
would just mimic what fsck already does in order to check multiple 
drives when running in preen mode.


It seems that it would be really helpful (and possibly harmless) if fsck 
could be forced to do checks in parallel when running with '-y' when 
console interaction is not needed anyway, or perhaps through a new 
switch (-Y?).


I could try to eventually modify the fsck source and somehow change the 
default '-y' behavior. But I wouldn't like to carry such additional 
luggage of custom patches on all servers and also I don't think that I 
am the most qualified person to do so.


So in case someone still reads this, please advice.

Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: em device hangs on ifconfig alias ...

2006-06-28 Thread Atanas


Dan Nelson said the following on 6/28/06 3:52 PM:

In the last episode (Jun 28), User Freebsd said:

has anyone figured out why the em device 'hangs' for about 30-45
seconds whenever you ifconfig alias a new IP on to the device?


The em driver resets the card when you add an IP to it, and unless
you've configured your switch not to autodetect fancy features on that
port, it may very well take 45 seconds for it to come up.

For me the em reset actually takes about a second or so per single IP 
alias. But more aliases you got, longer the timeout becomes. In case you 
have hundreds (like I do), a single reboot might cost you something like 
10-15 minutes of downtime, just for the aliases to come up.


That's the primary reason I stay away from the on-board 1Gbps em NICs 
that almost every Intel server board nowadays comes with. I simply 
disable them and use a good old (and cheap) Intel PRO/100 fxp compatible 
PCI NIC instead. It's fast enough and doesn't reset the card when you 
add an alias. The only downside is that it gives you 100Mbps at most.


Does anybody know a better NIC driver alternative when dealing with lots 
of IP aliases?


I have some newer machines with 2 Broadcom chips on-board. I plan to 
give them a try at some point in the future, but I'm not sure how stable 
the bge driver is when compared to fxp and em.


Regards,
Atanas

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: fsck_ufs locked in snaplk

2006-04-25 Thread Atanas


Kris Kennaway said the following on 4/25/06 9:22 AM:

On Tue, Apr 25, 2006 at 06:39:09PM +0300, Kostik Belousov wrote:


Obviously, revisions 1.78, 1.79 of the sys/ufs/ufs/ufs_quota.c
shall be MFCed. Try this patch (note, I does not tested it):


WTF, I could have sworn I merged that!  Yes, this patch is needed. 
However, I don't think it's the cause of runtime deadlocks.



Thanks for the heads up!
I was just about to release the next production box without checking
that and assuming the QUOTA fix was already in place.

I would like to confirm that I have another fully loaded server running
6.1-PRERELEASE (BETA2 based on 6-STABLE) from Mar 1 with manually
patched sys/ufs/ufs/ufs_quota.c (1.74.2.1 2006/01/14) with a similar
diff generated from CURRENT between 1.77 and 1.80.

55 days uptime and no problems so far.

Regards,
Atanas

P.S. Forgot to CC the list, sorry for the double post.


Index: sys/ufs/ufs/ufs_quota.c
===
RCS file: /usr/local/arch/ncvs/src/sys/ufs/ufs/ufs_quota.c,v
retrieving revision 1.77
retrieving revision 1.79
diff -u -r1.77 -r1.79
--- sys/ufs/ufs/ufs_quota.c 9 Jan 2006 20:42:19 -   1.77
+++ sys/ufs/ufs/ufs_quota.c 12 Feb 2006 13:20:06 -  1.79
@@ -429,8 +429,9 @@
quotaoff(td, mp, type);
ump-um_qflags[type] |= QTF_OPENING;
mp-mnt_flag |= MNT_QUOTA;
-   ASSERT_VOP_LOCKED(vp, quotaon);
+   vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, td);
vp-v_vflag |= VV_SYSTEM;
+   VOP_UNLOCK(vp, 0, td);
*vpp = vp;
/*
 * Save the credential of the process that turned on quotas.
@@ -535,8 +536,9 @@
}
MNT_IUNLOCK(mp);
dqflush(qvp);
-   ASSERT_VOP_LOCKED(qvp, quotaoff);
+   vn_lock(qvp, LK_EXCLUSIVE | LK_RETRY, td);
qvp-v_vflag = ~VV_SYSTEM;
+   VOP_UNLOCK(qvp, 0, td);
error = vn_close(qvp, FREAD|FWRITE, td-td_ucred, td);
ump-um_quotas[type] = NULLVP;
crfree(ump-um_cred[type]);








___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: FreeBSD/i386 6-stable + 4 GB RAM

2006-03-14 Thread Atanas


Oliver Fromme said the following on 03/14/06 08:30:


Will FreeBSD/i386 6-stable run on a 4 GB machine out of the
box?  Do I have to apply special tuning (kernel config or
sysctl or whatever)?  Using PAE shouldn't be necessary, I
assume.

All it depends is what size of memory address space the motherboard 
manufacturer decided to reserve for PCI devices. I've seen boards with 
PCI window size ranging from 256 to 1024MB. In order to utilize the full 
amount of RAM you would need PAE.



It would also be interesting to know if 4-stable (which is
currently running on the predecessor machines) would run
without problems on those new 4 GB ones, too.

For 4-STABLE you would need to adjust KVA_PAGES. Otherwise, depending on 
the load and the memory usage, you might get random crashes. Or at least 
this is what I experienced when upgrading RAM on a bunch of 4.x based 
machines.


Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: well-supported SATA RAID card?

2006-03-10 Thread Atanas


Brian Szymanski said the following on 03/10/06 01:01:

Howdy...

After not having much success with the hptmv driver for highpoint's
rocketraid 1820A, I'm wondering if other folks have had good luck with any
SATA RAID cards with at least 6 ports... Is there a SATA RAID card with
utilities that let you manage while the OS is running that folks have had
good luck with? I've been happy with the megaraid series on linux at my
job, but I'm wondering if the management utilities are there on freebsd,
etc.


3ware.

Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

hw.realmem on i386

2006-02-23 Thread Atanas

I'm setting 6.1-BETA2/i386 on a AMD-based (dual Opteron 270) Tyan K8SE 
S2892 motherboard with 4GB RAM. The PCI memory address range on this 
board takes entire gigabyte, leaving only 3GB of usable memory in i386 
mode. The remaining part gets remapped (by the BIOS) above the 4GB limit.


On the Intel server boards the PCI range used to take only 256MB or 
512MB, so I could afford ignoring that. But 1GB now seems too much and I 
decided to compile a PAE enabled kernel and see what happens.


The PAE enabled kernel detects the full amount of RAM, boots normally 
and seems all right so far (not in production yet):


  # dmesg |grep memory
  real memory  = 5368709120 (5120 MB)
  avail memory = 4182597632 (3988 MB)

  # memcontrol list
  ...
  0x0/0x8000 BIOS write-back set-by-firmware active
  0x8000/0x4000 BIOS write-back set-by-firmware active
  0x1/0x4000 BIOS write-back set-by-firmware active

The thing that puzzles me is the sysctl hw.realmem value:

  # sysctl -a |grep hw.*mem:
  hw.physmem: 4286291968
  hw.usermem: 4106076160
  hw.realmem: 1073741824

Wasn't this supposed to be greater that both hw.physmem and hw.usermem?

Or at least this is what I see on all other (non-PAE) boxes I have:

  realmem  physmem  usermem

Here are a few examples:

Intel SE7501WV2, 4GB (-256MB), non-PAE i386:
  hw.physmem: 4017508352
  hw.usermem: 3792785408
  hw.realmem: 4026466304

Intel SE7520JR2, 4GB (-512MB), non-PAE i386:
  hw.physmem: 3749007360
  hw.usermem: 3285360640
  hw.realmem: 3757965312

AMD-based, 4GB, amd64:
  hw.physmem: 4218327040
  hw.usermem: 399648
  hw.realmem: 4227792896

I'm wondering what impact such a supposedly incorrect hw.realmem  
hw.physmem value could have, and whether the kernel options would need 
to be tweaked manually in order to fix that.


I remember a case when I had to upgrade a 4.x based box from 2GB to 4GB, 
so vm.kvm_free became larger than vm.kvm_size resulting in random 
crashes (until I realized that I had to manually adjusting the 
KVA_PAGES kernel option, but it does not seems to be much relevant here).


I'm running 6.1-PRERELEASE (6-STABLE) from Feb 22 2005 with the 
following mods to the kernel configuration files:


GENERIC:
 cpu   I486_CPU
 options   INET6   # IPv6 communications protocols
 options   SCSI_DELAY=5000 # Delay (in ms) before probing SCSI
 options   QUOTA
 options   SMP # Symmetric MultiProcessor Kernel

PAE:
 options   IPFIREWALL
 options   DUMMYNET

Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: AMD64 or I386

2006-02-22 Thread Atanas


Albert Shih said the following on 02/22/06 16:59:

Hi all

I've very strange problem with my new servers with AMD single core dual
proc with 4 Go Ram.

When I boot the i386 version of FreeBSD 6.0 he see 4 Go but tell me he
can't not access to 4go but only 3 Go. I upgrade to FreeBSD 6-Stable and 
nothing change.

When I boot the amd64 version of FreeBSD 6.0 he see 5 Go (!!) but tell me
he can access only 4 go (well). When I upgrade to FreeBSD 6-Stable nothing
change.

My problem is I need i386 version (because I'need maxima who need
sbcl...and sbcl don't run on amd64).

If I tell the kernel I've 4 Go the system don't boot (kernel panic).

What can I do.

You might need to compile a PAE enabled kernel, see the pae(4) man page 
for more details.


Regards,
Atanas

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: SSH login takes very long time...sometimes

2006-02-17 Thread Atanas


Carl Makin said the following on 02/16/06 20:07:

Atanas wrote:
Does anybody know whether ipfw (or something else within FreeBSD-4) is 
capable of setting connection rate limits?


I'm using SEC to monitor the auth.log file and block any IP addresses 
that fail a password 3 times within 60 seconds.  I use the following 
sec.conf file;



Yeah, it does pretty much the same thing I do with a simple script like:

#!/usr/bin/perl
use strict;

my $MAX_TRIES = 5;
my $RULE_BASE = 10100;
my $RULES_MAX = 10;
my $Rule = $RULE_BASE;
my %Match;

sub ip_block  # ($ip, $port)
{   my ($ip, $port) = @_;

`ipfw delete $Rule` if `ipfw list $Rule 2/dev/null`;
`ipfw add $Rule deny tcp from $ip to any $port in setup`;

$Rule = $RULE_BASE + (++$Rule - $RULE_BASE) % $RULES_MAX;
}

open LOG, tail -f /var/log/auth.log |;
while (LOG) {

if( /sshd\[\d+\]/ ) {
if( /((Illegal user|Failed password for) \S+|Did not receive 
identification string) from (\d+\.\d+\.\d+\.\d+)/ ) {

my $ip = $3;
next if $Match{$ip}++  $MAX_TRIES;
ip_block($ip,22);
undef $Match{$ip};
}
}
}
close F;

And a cron job removes the blocks every hour:

7 * * * * /sbin/ipfw delete 10100 10101 10102 10103 10104 10105 10106 
10107 10108 10109


It does the job, but it would be nice for sshd to have some rate-limit 
protection built-in. Otherwise, with the increasing number of attacks 
nowadays, many people would need similar protection.


Regards,
Atanas

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: SSH login takes very long time...sometimes

2006-02-17 Thread Atanas


Marian Hettwer said the following on 02/17/06 00:39:

Atanas wrote:

Last year I already had to decrease the LoginGraceTime from 120 to 30
seconds on my production boxes, but it didn't help much, so on top of
that I got to implement (reinvent the wheel again) a script tailing the
auth.log and firewalling bad gyus in order to secure sshd and let my
legitimate users in.


You could get rid of parsing auth.log and everything and just use pf(4)
instead.

Look at that:
# sshspammer table
table sshspammer persist
block log quick from sshspammer

# sshspammer
# more than 6 ssh attempts in 15 seconds will be blocked ;)
pass in quick on $ext_if proto tcp to ($ext_if) port ssh $tcp_flags
(max-src-con
n 10, max-src-conn-rate 6/15, overload sshspammer flush global)

Thanks for the suggestion! The pf in 5.x/6.x base and especially its 
rate-limit capability seems to be a good reason to upgrade my existing 
4.x based boxes before RELENG_4's EoL.


Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: SSH login takes very long time...sometimes

2006-02-17 Thread Atanas


Mike Tancsa said the following on 02/17/06 11:50:

At 09:17 PM 16/02/2006, Atanas wrote:

Does anybody know whether ipfw (or something else within FreeBSD-4) is 
capable of setting connection rate limits?


Why not just launch sshd out of inetd ?


Primarily because of the big scare sign in the sshd man page:

 -i   Specifies that sshd is being run from inetd(8).  sshd is normally
  not run from inetd because it needs to generate the server key
  before it can respond to the client, and this may take tens of
 ^^^
  seconds.  Clients would have to wait too long if the key was
  ^^^
  regenerated every time.  However, with small key sizes (e.g.,
  512) using sshd from inetd may be feasible.

It was my fault not verifying how much time it really takes. I just 
tested it on a couple of machines, and it seems to be way faster:


  # time ssh [EMAIL PROTECTED]

  real0m0.669s
  user0m0.012s
  sys 0m0.000s

  # time ssh [EMAIL PROTECTED]

  real0m0.374s
  user0m0.000s
  sys 0m0.008s

  # time ssh [EMAIL PROTECTED]

  real0m0.348s
  user0m0.000s
  sys 0m0.008s


I ran this multiple times. The first one defaults to 2048-bit key (a 
6-STABLE based box), the second one - to 1048 bit (5.4), the third one 
to a standalone ssh daemon.


So what the man page says about the timings could have been true some 10 
years ago, but not now.



Start up inetd with -wWl -C 5

In inetd.conf
ssh stream  tcp nowait  root  /usr/sbin/sshd /usr/sbin/sshd -i

This will allow 5 connections per min from a single IP.

Yeah, I still use it to run (pro)ftpd, and never had problems with that. 
It's possible to specify also per entry limits, like:


ftp  stream  tcp  nowait/100/60/10  root  /usr/libexec/ftpd  ftpd -l
ssh  stream  tcp  nowait/50/10/5root  /usr/sbin/sshd sshd -i

50/10/5 = max-children/max-conn-per-ip-per-minute/max-child-per-ip

Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: SSH login takes very long time...sometimes

2006-02-16 Thread Atanas


Dag-Erling Smørgrav said the following on 02/15/06 23:35:

David Malone [EMAIL PROTECTED] writes:

I did once mail des@ to ask him if he'd mind me changing the default
login timeout for sshd to be (say) 5 minutes rather than 1 minute,
but I think he was busy at the time. Judging by the PR mentioned
above it should be at least 2m30s by default. Des, would you mind
this change being made?


No objection, just let me see the patch first.

DES


Just a thought, wouldn't this open a new possibility for denial of 
service attacks?


Last year I already had to decrease the LoginGraceTime from 120 to 30 
seconds on my production boxes, but it didn't help much, so on top of 
that I got to implement (reinvent the wheel again) a script tailing the 
auth.log and firewalling bad gyus in order to secure sshd and let my 
legitimate users in.


I really miss the inetd features. A setting like nowait/100/20/5 
(/max-child[/max-connections-per-ip-per-minute[/max-child-per-ip]]) 
would effectively bounce the bad guys, but AFAIK (correct me if I'm 
wrong), ssh is no longer supposed to work via inetd and still has no 
such capabilities.


I'd be nice to have something like for instance the sendmail's client 
and rate connection limits, but I guess this is not the right place to ask.


Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: SSH login takes very long time...sometimes

2006-02-16 Thread Atanas


David Malone said the following on 02/16/06 13:24:
Just a thought, wouldn't this open a new possibility for denial of 
service attacks?


I doubt it. I'm guessing you're thinking of an attack where someone
makes many connections to sshd in a short time and runs you out of
processes? I think you can protect against this with the MaxStartups
directive in sshd_config. The amount of time that an attacker has
to open many connections is probably not that important, as you can
open a lot of TCP connections in 1 second even with a small link.

These were different types of attacks, primarily originating from single 
IP addresses:


1. Dictionary attacks taking as much concurrent unauthenticated 
connections as possible and with speeds as fast as the server can 
respond. These were happening like a few up to several times a day, 
sometimes lasting hours.


2. Time based attacks taking again all of the available MaxStartups, but 
then doing nothing until the LoginGraceTime expires, then again, etc. 
These were not so frequent, but had the worst impact on the ssh 
availability.


3. Network scans on machines hosting some hundreds (or in some cases 
thousands) of IP addresses, causing outages lasting just a few minutes 
or so.


Last year I already had to decrease the LoginGraceTime from 120 to 30 
seconds on my production boxes, but it didn't help much, so on top of 
that I got to implement (reinvent the wheel again) a script tailing the 
auth.log and firewalling bad gyus in order to secure sshd and let my 
legitimate users in.


Are you trying to prevent the ssh scanners that just try well-known
combinations of usernames and passwords? It is not clear that you
gain much by firewalling these off, other than having fewer log
messages.

All of the above three. It wasn't just a matter of too much log 
messages. The type 1. for instance, besides the ssh unavailability, and 
depending on the MaxStartups setting, can bring a server to its knees by 
dedicating all of its available resources for bouncing unauthenticated 
ssh requests.


I tried setting a 'limit' ipfw firewall rule, something like:

  ipfw add allow tcp from any to any 22 in setup limit src-addr 5

I already had success with such a rule for a first level SMTP DoS 
protection before sendmail got its per-client and rate connection limits 
built in (since 8.13.0), and still keep that on, just in case.


But unlike sendmail, ssh instances when hammered with bogus requests are 
way more CPU intensive. I couldn't afford limiting the ssh connectivity 
to just one single session per client IP (someone might need multiple 
ssh sessions while working on something, right?), and in case of 
multiple sessions enabled, machines would be still vulnerable to CPU 
overload (i.e. bouncing tons of useless ssh authentication attempts).


So the best option for me was to implement a log analyzer script placing 
temporary blocks on the firewall when necessary. Like after 5 Illegal 
user or Failed password for or Did not receive identification 
string events, the script simply denies that IP right away on the 
firewall for one hour. So far this works well (for about 6 months 
already) and I no longer see unusual load spikes or ssh connectivity 
outages like before.


I really miss the inetd features. A setting like nowait/100/20/5 
(/max-child[/max-connections-per-ip-per-minute[/max-child-per-ip]]) 
would effectively bounce the bad guys, but AFAIK (correct me if I'm 
wrong), ssh is no longer supposed to work via inetd and still has no 
such capabilities.


You can still run sshd through inetd (or, at least, the -i option
is still documented in the sshd man page). If does suggest that you
may need to reduce the key size to make this practical (increasing
LoginGraceTime here may help too ;-)

I knew that, but actually never tried it thinking it would be too slow. 
Now I just ran a ssh-keygen and found that it takes only a few seconds 
for a 1024-bit and several for 2048-bit key. So it's not that much bad 
and running it with 512 or 1024-bit key through inetd seems feasible enough.


The default ssh key length in FreeBSD-6 however just got doubled from 
1024 to 2048-bit. I believe there's a reason for that and don't like the 
idea of going down.


Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: SSH login takes very long time...sometimes

2006-02-16 Thread Atanas


Niki Denev said the following on 02/16/06 16:11:


I solved this for me with the following pf(4) rule :

pass in quick on $ext inet proto tcp from any to any port ssh flags S/SA \
  keep state (source-track rule, max-src-conn $max_conn_per_ip, 
max-src-conn-rate $max_conn_rate, \
  overload tempban-ssh flush global)

with appropriate $max_conn_per_ip and $max_conn_rate limits,
and expiretable in a cronjob to flush all entries in the tempban-ssh table 
which
are older than predefined period.

I hope this helps.

Thanks for the tip! I knew that at some point I will have to switch to 
pf, but unfortunately it wasn't available in FreeBSD-4.x, and I still 
have plenty of such boxes.


Does anybody know whether ipfw (or something else within FreeBSD-4) is 
capable of setting connection rate limits?


Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: SSH login takes very long time...sometimes

2006-02-16 Thread Atanas


[EMAIL PROTECTED] said the following on 02/16/06 14:49:

Hello,
You should try Xinetd as it has more options to help with this. I beleive
you SSH problem is due to a DNS/RDNS problem.

No, it wasn't a DNS issue. (x)inetd would help, but in such a case sshd 
would need to generate a server key (takes seconds and CPU) on every 
incoming ssh connection, which would be kind of slow and wasteful.


Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: 6.0-RELEASE freeze repetitively

2006-01-30 Thread Atanas


Henri Hennebert said the following on 01/07/06 03:52:

Hello,

I have a production server which freeze from time to time
(from 5 hours to 7 days between freeze).

I try (in /boot/loader.conf.local):

  - hint.acpi.0.disabled=1

  - debug.mpsafevfs=0
debug.mpsafevm=0

  to no avail.

If you have SMP and QUOTA enabled, you might find the following threads 
useful:


http://lists.freebsd.org/pipermail/freebsd-hackers/2005-November/014339.html
http://lists.freebsd.org/pipermail/freebsd-stable/2005-December/020606.html
http://lists.freebsd.org/pipermail/freebsd-stable/2006-January/021693.html

All of the possible workarounds I know are limited to the following:
- disable SMP
- disable QUOTA support
- downgrade to 5.4(-STABLE)

This should have been in errata since mid Nov 2005.

Regards,
Atanas



I setup a serial console and go into ddb and get:

KDB: enter: Line break on console
[thread pid 61 tid 100052 ]
Stopped at  kdb_enter+0x30: leave
db
db bt
Tracing pid 61 tid 100052 td 0xc355e300
kdb_enter(c075ddb2,a,9e0a90,417ff9,e69e0aac) at kdb_enter+0x30
siointr1(c3660800,a,2,2,6400) at siointr1+0xe7
siointr(c3660800,257,0,4,c355e300) at siointr+0x78
intr_execute_handlers(c352b090,e69e0aa0,e69e0af8,c06fee23,34) at 
intr_execute_handlers+0x98

lapic_handle_intr(34) at lapic_handle_intr+0x3a
Xapic_isr1() at Xapic_isr1+0x33
--- interrupt, eip = 0xc057d1c0, esp = 0xe69e0ae4, ebp = 0xe69e0af8 ---
vop_stdlock(c078e260,e69e0b50,e69e0b20,c0721424,e69e0b50) at vop_stdlock
ffs_lock(e69e0b50,0,2012,c54db440,e69e0b6c) at ffs_lock+0x19
VOP_LOCK_APV(c078db60,e69e0b50,c3811400,e69e0b4c,c057d232) at 
VOP_LOCK_APV+0x54

vn_lock(c54db440,2012,c355e300,c079a980,c6029000) at vn_lock+0x13e
vget(c54db440,2012,c355e300,c81c1aa0,1) at vget+0xff
qsync(c3811400,c3811400,c355e300,c3b64660,0) at qsync+0x1df
ffs_sync(c3811400,3,c355e300,c355e300,c355e300) at ffs_sync+0x3fb
sync_fsync(e69e0ca0,e69e0cbc,c05876a4,c0782700,e69e0ca0) at 
sync_fsync+0x21b

VOP_FSYNC_APV(c0782700,e69e0ca0,c355e300,0,e69e0cbc) at VOP_FSYNC_APV+0x3e
sync_vnode(c39093f0,c355e300,68,c07493b5,0) at sync_vnode+0x1b4
sched_sync(0,e69e0d38,4489e045,458d1424,244489dc) at sched_sync+0x2ef
fork_exit(c05877a0,0,e69e0d38) at fork_exit+0x80
fork_trampoline() at fork_trampoline+0x8
--- trap 0x1, eip = 0, esp = 0xe69e0d6c, ebp = 0 ---
db ps
  pid   proc uid  ppid  pgrp  flag   stat  wmesgwchan  cmd
94717 c3e64624   80 94700 94696 0004000 [RUNQ] rrdtool
94703 c3c086240   800   800 101 [RUNQ] rsync
94702 c3e6c000   80 94698 94698 0004000 [RUNQ] perl
94701 c3e6420c  113 94697 94697 0004001 [RUNQ] perl5.8.7
94700 c5f41a3c   80 94696 94696 0004000 [SLPQ wait 0xc5f41a3c][SLP] 
perl5.8.7

94698 c35fb418   80 94693 94698 0004000 [SLPQ wait 0xc35fb418][SLP] sh
94697 c3e6c20c  113 94694 94697 0004000 [SLPQ wait 0xc3e6c20c][SLP] sh
94696 c382a20c   80 94692 94696 0004000 [SLPQ wait 0xc382a20c][SLP] sh
94694 c35fb0000   710   710 000 [SLPQ piperd 0xc4adc660][SLP] cron
94693 c8160c480   710   710 000 [SLPQ piperd 0xc815b4c8][SLP] cron
94692 c5f406240   710   710 000 [SLPQ piperd 0xc38a9198][SLP] cron
94689 c815f0000 1 94689 100 [SLPQ select 0xc07a7404][SLP] 
sendmail
94584 c3e6ca3c 60667   678   678 100 [SLPQ lockf 0xc5c2ea80][SLP] 
perl5.8.7
94338 c3c08a3c 60667   678   678 100 [SLPQ lockf 0xc5c636c0][SLP] 
perl5.8.7

94300 c43ab8308   758   739 0004000 [SLPQ nanslp 0xc07a086c][SLP] sleep
93950 c389f624 60667   678   678 100 [SLPQ lockf 0xc5c2e380][SLP] 
perl5.8.7
93326 c43aaa3c 60667   678   678 100 [SLPQ select 0xc07a7404][SLP] 
perl5.8.7
93248 c3c088300   694   694 100 [SLPQ select 0xc07a7404][SLP] 
sendmail
92514 c43ab6248   754   754 0004000 [SLPQ sbwait 0xc4a43a64][SLP] 
perl5.8.7
92513 c3c084188   754   754 0004000 [SLPQ select 0xc07a7404][SLP] 
innfeed

88624 c816020c0 1 88624 0004002 [SLPQ ttyin 0xc3666010][SLP] getty
88573 c38274180 1 88573 0004002 [SLPQ ttyin 0xc3666410][SLP] getty
88570 c43a820c0 1 88570 0004002 [SLPQ ttyin 0xc3665410][SLP] getty
 3820 c39ca6240 1  3820 0004002 [SLPQ ttyin 0xc3664c10][SLP] getty
  905 c43ab20c0 1   905 000 [SLPQ select 0xc07a7404][SLP] inetd
  896 c3e64c480 1   896 000 [SLPQ select 0xc07a7404][SLP] 
moused

  872 c389c0000 168 000c082 (threaded)  java
   thread 0xc60c9600 ksegrp 0xc389e1e0 [SLPQ kserel 0xc389e214][SLP]
   thread 0xc4f56180 ksegrp 0xc389e1e0 [SLPQ sbwait 0xc5e2ae90][SLP]
   thread 0xc49eda80 ksegrp 0xc389e1e0 [SLPQ sbwait 0xc6191a64][SLP]
   thread 0xc4f61180 ksegrp 0xc389e1e0 [SLPQ kserel 0xc389e214][SLP]
   thread 0xc49fb900 ksegrp 0xc389e1e0 [SLPQ sbwait 0xc3a37a64][SLP]
   thread 0xc4d13a80 ksegrp 0xc389e1e0 [SLPQ sbwait 0xc60d4e90][SLP]
   thread 0xc5f5c900 ksegrp 0xc389e1e0 [SLPQ accept 0xc3d9ab5a][SLP]
   thread 0xc4f22d80 ksegrp 0xc389e1e0 [SLPQ sbwait 0xc5eba20c][SLP]
   thread 0xc622ad80 ksegrp 0xc389e1e0 [SLPQ sbwait 0xc61914d4][SLP]
   thread 0xc3e65300

Re: diskio / filesystem related deadlock on SMP 6.0-STABLE machine.

2006-01-27 Thread Atanas



Kris Kennaway said the following on 1/26/2006 3:46 PM:

On Thu, Jan 26, 2006 at 06:44:22PM -0500, Kris Kennaway wrote:

On Thu, Jan 26, 2006 at 06:37:16PM -0500, Mike Jakubik wrote:

Kris Kennaway wrote:

On Thu, Jan 26, 2006 at 05:07:56PM +0200, Niki Denev wrote:
 

On Thursday 26 January 2006 10:40, Niki Denev wrote:
   
[...]


After i disabled option QUOTA in both my default kernel config
and the one i compiled with the debugging options i was unable
to reproduce the deadlock again. (i hope it stays that way :) )
This, together with the report in my previous post probably point
that the problem is in the QUOTA support.
   

Actually, I think this is known.

Kris
 
Well thats good to know, i was planning on upgrading a production box 
from 5 to 6, its SMP and uses QUOTA. How did 6 get released when QUOTA 
was known to cause deadlocks?


FYI, you can probably work around this by setting debug.mpsafevfs=0.
Of course, you'll lose the filesystem performance benefits.

Kris


I'd like to confirm that setting debug.mpsafevfs=0 (along with 
debug.mpsafevm=0 and debug.mpsafenet=0) doesn't help either.


I ran into the same problem about a month ago on 2 production boxes 
running 6-STABLE (SMP with QUOTA enabled) and the only way I found to 
stop them crashing was switching back to 5.4.


I hope this will get fixed in 6.1.

Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: 6.0-STABLE can't see floppy drive for KLD loading

2005-12-30 Thread Atanas


Kim Culhan said the following on 12/29/05 14:10:

Trying to install 6.0-STABLE on a Supermicro 5014C-MT server with
a P8SCT motherboard.

Also installed is a 3Ware 9550SX sata raid controller.

The latest 6.0-STABLE snap from 12-08-05 does not have the version
of the TWA driver which supports the 9550SX so it would be necessary to
kldload the binary, available from the 3Ware web site.

Running sysinstall and navigating to Configure--Load KLD
the installation fails as it complains: Cant find the floopy

The hardware appears to be good as it is possible to boot a dos floppy.

Have not been successful finding a more recent snap.. the changes for the
9550SX controller were about a day late to be included in the 12-08 snap :(

Any info in greatly appreciated


You could do the following:
- attach a standalone drive (IDE,SATA,SCSI)
- install FreeBSD and upgrade it to *-STABLE
- partition a 3ware drive (array, whatever) and make it bootable
- transfer the OS installation from the standalone drive to the 3ware 
partition.


Or you could create a bootable USB (hard, flash, iPod, digital camera, 
etc.) drive up to date with -STABLE (see /usr/src/tools/tools/nanobsd), 
boot the 3ware box from it, and install over the network.


I haven't tried the latter myself, but already used a nanobsd based USB 
flash drive as emergency recovery media on a 9550SX based box, and it 
worked.


Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: 6.0 random freezes

2005-12-14 Thread Atanas


Peter Jeremy said the following on 12/13/05 02:00:


Note that PS/2 keyboards aren't hot-pluggable and attempts to do so
can have deleterious effects on your keyboard and/or motherboard.  In
any case, the probe/attach sequence relies on the kernel being in a
reasonably sane state (and I'm not sure if it will detect the keyboard
as a console device except at boot time).

I agree, but the keyboard is a passive device (with no power source, 
i.e. mostly harmless), and it's a standard practice to have only few 
movable consoles for several racks and plug them in only where it's 
necessary. It always has been working for us and I don't remember having 
any hot-plugging accidents for years.



If the keyboard has been plugged in since the system booted, do you
still get the same no response?  If so, the kernel has wedged at
a fairly low level and I'm not quite sure how to proceed other than
by enabling the sanity checks that other people have mentioned
(eg WITNESS, INVARIANTS) and hoping they catch something.

I cannot say for sure. When the thing happens I'm usually away, and 
until I go there, the console could have been used by someone. I'm in 
process of getting a serial console, so if there's no response as well, 
I will enable the sanity checks.



I only mentioned serial consoles on the off-chance that you had one.
Whilst it may not help here, serial consoles have a number of
advantages when managing remote equipment


Thanks for pointing this. As I said I'm in process of getting one for 
now, and possibly equipping some dozens of servers with that later.


After the downgrade we could eventually set a test bed and start 
hammering it with requests. The problem would be how to trigger the 
crash and whether we would be able to reproduce it at all.


I already went to the 5.4 downgrade way. Actually I was forced to do so 
during the other night, when one of the machines started hanging up in 
every half an hour or so. Looks like the background fsck on the slower 
SATA based RAID5 array helped a lot with that.


Now I have the test bed online. This is the very same server (SCSI 
based, with the OS drive intact and production data drives moved 
elsewhere) that was crashing once a day or so. Hopefully tomorrow I will 
have a serial console attached to it, so we can start pounding it. I 
hope this machine won't need to go in production during the next month 
or so and we'll have enough time for tests.


 Depending on your application and the interfaces to it, it might be
 feasible to either tee live traffic into both systems and just junk
 the responses from your test bed, or record live traffic and
 replay it into your test bed.

It runs a fairly complex set of services. It's a shared web hosting 
server handling some hundreds of websites, and also email 
SMTP/POP3/IMAP, databases MySQL, FTP, DNS, etc.


I don't know how easy would be implement such traffic gathering and 
replaying that on the test bed. It seems kind of complicated at first 
sight (though I realize it might be the only way to reproduce the 
crash). We might need some NAT (via ipfw?), some services might not like 
their responses being junked, etc.


I was thinking about trying the kernel stress suite first. Or just have 
something rsync-ing lots files back and forth (possibly over the 
network), run apache bench in a loop and point it to some database 
intensive page, etc.


Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: 6.0 random freezes

2005-12-13 Thread Atanas


Atanas said the following on 12/12/05 18:57:

Peter Jeremy said the following on 12/12/05 13:40:



 When it hangs, break into DDB (Ctrl-Alt-Esc on the console or BREAK on
 a serial console).

 But if I have no keyboard response I won't be able to save it, right?

(replying to myself)
This is exactly what I was afraid would happen. The SATA based box just 
hung up again, with all of the kernel debugging options in place:


  makeoptions   DEBUG=-g
  options   KDB
  options   DDB

But I wasn't able to do anything with the keyboard in order to save a 
crashdump, so I got no other choices than hitting the reset button.


Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

6.0 random freezes

2005-12-12 Thread Atanas


Hi,

I have 3 machines running 6.0-RELEASE, and recently 2 of them started 
freezing once a day or so. There are no error messages on the console or 
in the system logs.


The first one I put in production about a month ago and it was working 
flawlessly until it got some load and now it started freezing almost 
every day. The second one has exactly the same behavior - it was fine 
when doing nothing (a couple of weeks), and started freezing when loaded.


The load I'm talking about is less than moderate (less that 2.0 with 
plenty of CPU idle time). The freezing thing also does not appear to 
happen at peak times (I have rrdtool based CPU load graphs).


Both machines have (almost) identical motherboards:

Intel SE7520JR2SCSID2 and SE7520JR2ATAD2
2 Intel XeonE 3.2GHz 800MHz CPUs
4GB DDRII400 RegECC RAM

The first one has 8 72GB SEAGATE ST373207LC 0003 Ultra320 SCSI drives 
attached as plain drives (no raid) to the on-board LSILogic 1030 Ultra4 
Adapter.


The second one has 8 500GB SEAGATE ST3500641AS SATA2 drives attached 
to a 3ware Model 9550SX-8LP controller and configured as a RAID5 array.


The motherboards have 2 1000Mbps NICs on board, but due to some (em) 
driver problems, I usually disable these from BIOS and use a PCI Intel 
100Mbps (fxp) instead.


Both machines were running 6.0-RELEASE, i386. For the last one I had to 
updated the twa driver manually, as the one shipped with 6.0 didn't 
support 3ware 9550SX. I see that new version recently got committed into 
the -STABLE branches.


Here are the diffs against the GENERIC kernel configuration:

 cpu   I486_CPU
 cpu   I586_CPU

 makeoptions   DEBUG=-g# Build kernel with gdb(1) debug 
symbols


 options   INET6   # IPv6 communications protocols
53d47
 options   SCSI_DELAY=5000 # Delay (in ms) before probing SCSI

 options   QUOTA
 options   SMP # Symmetric MultiProcessor Kernel

/boot/loader.conf:

kern.ipc.nmbclusters=65536

/etc/stysctl.conf:

kern.ipc.somaxconn=1024
net.inet.tcp.recvspace=16384
net.inet.ip.fw.verbose=1
machdep.hyperthreading_allowed=1

Both machines boot with ACPI and hyperthreading enabled.

First I suspected the hardware, so I replaced the entire box (keeping 
the same drives) - no changes - it got frozen again in less than 24 hours.


Then I disabled ACPI (hint.acpi.0.disabled=1) and the hyperthreading - 
no change - the same thing.


Then after reading all related (I believe) postings here and in 
freebsd-current, I decided to upgrade both boxes to 6.0-STABLE (I saw a 
lot of changes in the source tree), but the thing continued to happen.


I have another machine with the same hardware components (the SCSI based 
one), but running 5.4-RELEASE. Unlike these two, it's really loaded 
(even got DDoS-ed a while ago) and I had zero problems with it for months.


I remember having similar issues when performing 4GB RAM upgrades on a 
bunch of 4.x based boxes, when I had to set KVA_PAGES to something like 
512. For 5.3+ however this is no longer seems to be an issue.


I would provide more useful feedback if I had some real and relevant 
error messages. Actually I got some unusual errors on only one of the 
affected servers:


Dec 11 02:48:36 xyz kernel: calcru: runtime went backwards from 28636364 
usec to 28636021 usec for pid 28588

 (httpd)

But it does not seem to be much relevant to the problem as it did not 
happened to be any close to the freezes (i.e. it was 26 hours after the 
last crash and 19 hours before the next one).


Now the only reasonable option for me (I mean for production and in 
relatively short term) seems going downward to 5.4 and wait until 6.x 
get more stable


Two dmesg.boot files attached.

Any comments, suggestions and questions are welcome.

Regards,
Atanas
Copyright (c) 1992-2005 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 6.0-STABLE #0: Fri Dec  9 14:54:05 PST 2005
[EMAIL PROTECTED]:/var/obj/usr/src/sys/XYZ
ACPI APIC Table: A M I  OEMAPIC 
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: Intel(R) Xeon(TM) CPU 3.20GHz (3192.01-MHz 686-class CPU)
  Origin = GenuineIntel  Id = 0xf43  Stepping = 3
  
Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
  Features2=0x641dSSE3,RSVD2,MON,DS_CPL,CNTX-ID,CX16,b14
  AMD Features=0x2010NX,LM
  Hyperthreading: 2 logical CPUs
real memory  = 3757965312 (3583 MB)
avail memory = 3678597120 (3508 MB)
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
 cpu2 (AP): APIC ID:  6
 cpu3 (AP): APIC ID:  7
ioapic0: Changing APIC ID to 8
ioapic1: Changing APIC ID to 9
ioapic2: Changing APIC ID to 10
ioapic0 Version 2.0 irqs 0-23 on motherboard
ioapic1 Version 2.0 irqs 24-47 on motherboard
ioapic2

Re: 6.0 random freezes

2005-12-12 Thread Atanas


Claus Guttesen said the following on 12/12/05 13:23:

Both machines boot with ACPI and hyperthreading enabled.


Try to disable HTT in bios.


I think that I already achieved that by simply disabling the acpi module
from device.hints, and it had no effect to the problem.


It seldom gives you very much, and
somtetimes degrades performance. Is it a webserver?


It is a web server, and as such it tends to generate a lot of processes,
many of them independent of each other and trying to run simultaneously.
Thus more work horses (even less powerful virtual CPUs) make the server
to perform smoother.

This is just a practical observation though, and I could be wrong. I
would rather go with 2 dual core Opterons, but these are sort of
expensive for now.


If it generates
alot of temporary files you can try adding/changing the following in
/etc/sysctl.conf:

kern.ipc.somaxconn=2048
kern.maxfiles=65536
vfs.ufs.dirhash_maxmem=8388608


Currently I have the following:

  kern.ipc.somaxconn: 1024
  kern.maxfiles: 12328
  vfs.ufs.dirhash_maxmem: 2097152
  kern.openfiles: 1992

It's closest relative (running 5.4-RELEASE on the same hardware) handles
about twice more requests, temporary files, and open files.
kern.openfiles there is about 4000, and if something tries to go above
the limits, the kernel usually reports that.

I have plenty of other boxes serving at least twice more requests with
less powerful (also hyperthreaded) CPUs running 4.x and 5.x and with no
problems. The ones I have problems with are way less loaded, and are
supposedly faster ones.

Thanks for your suggestions!

Regards,
Atanas

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: 6.0 random freezes

2005-12-12 Thread Atanas


Ronald Klop said the following on 12/12/05 13:27:


What happens if you set one of these sysctl values to 0? (This disables  
SMP changes from 5.4 to 6.0.)

debug.mpsafevfs: 1
debug.mpsafenet: 1
debug.mpsafevm: 1


Thanks for the suggestion!
I just did so and rebooted both machines, so we'll see.

I remember unseting debug.mpsafenet before 5.4 due to some ipfw 
limitations, but didn't know about the other two.


And is there a possibility (performance-wise) to build a kernel with  
WITNESS and/or INVARIANTS options compiled in. This will give more info  
about possible locking problems. Your system will run slower. And 
because  of this the problem may not occur anymore, but it is worth the 
try.


Both machines are not much loaded, so I could afford slowing them down a 
bit for a while (I hope it won't be several times slower). I will do 
that at some point later if the problem still persists.


I hope I won't be forced to downgrade to 5.4, though I'm already working 
on that (just in case).


Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: 6.0 random freezes

2005-12-12 Thread Atanas


Peter Jeremy said the following on 12/12/05 13:40:


Define freezing:  Does it respond to pings?  Can you switch VTYs?
Do the num-lock/caps-lock LEDs respond?  Do some processes seem to
freeze before others?


I used the word freeze instead of crash, because the latter often
gets associated with some errors reported by the kernel in system logs
or on the console. In this case there are absolutely no error messages. 
I have also remote logging enabled (on another machine over the 
network), but there's nothing either.


When the thing happens, the server appears to respond to pings for the
first few minutes, but everything goes down until I go to the data canter.

When I plug a keyboard, there's no response at all - no LEDs, no VTYs, 
Ctrl-Alt-Esc, etc. You might think of hint.atkbd.0.flags not being set

properly, but it's right (i.e. unchanged, it appears to default to that
on i386 5.x+) and other machines with identical configuration do accept
keyboard.

I have no information about processes. Only the thing I have is a real 
time CPU load graph. I have a script tailing the output of a vmstat cpu 
15 and drawing a graph with user/system/idle times, so according to 
that graph there are no load spikes or unusual variations before the 
crashes. The usual user/system/idle percentages look like 10/7/83.



I suggest you add the following to your kernel config:
 options KDB # Enable kernel debugger support.
 options DDB # Support DDB.


I just set these along with the DEBUG option below, and got the new
kernel (from 6.0-RELEASE sources dated Dec 9) running on both machines,
so we'll see.


When it hangs, break into DDB (Ctrl-Alt-Esc on the console or BREAK on
a serial console).

As a start, run 'show lockedvnods' and 'ps'.  My guess is that you'll
see a lock that has a number of waiters - which is probably the
culprit.  Use 'panic' or 'call doadump' to get a crashdump and then
you can use kgdb to rummage around once you reboot - see
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebg-gdb.html


I don't have any experience in chasing kernel bugs, so I'm not sure
whether I would be able to get something useful, but I'll try that on 
the next crash. But if I have no keyboard response I won't be able to 
save it, right?


I do not know what a serial console is and would need some time to get 
along with it. Would I get something in addition to what I can get from 
the standard console?



 makeoptions   DEBUG=-g # Build kernel with gdb(1) debug symbols


I suggest you add this back in.  Without it, you can't debug any crash
dumps that you manage to get (and add dumpdev to your rc.conf).


My bad, I realized that it's kind of harmless, but it was weeks later
after I put the box in production. It's back there now.

The dumpdev variable seems to default to AUTO, i.e. trying to use the 
first swap device if it's bigger than the RAM (in my case yes), so I 
guess I don't need to touch it.



Whilst I realise that you can't have production machines freezing on
schedule, your assistance in providing more information about your
problem will help make 6.x more stable.


Yes, I know and I will try. Today I already had a couple of crashes
(got lucky, no nasty data corruptions this time), and I cannot afford 
this to continue.


I'm already working on the downgrade, but most likely I will have at 
least one of these 2 machines still running 6.x during the next day or two.


After the downgrade we could eventually set a test bed and start 
hammering it with requests. The problem would be how to trigger the 
crash and whether we would be able to reproduce it at all.


Thanks for the prompt reply!

Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: 6.0 random freezes

2005-12-12 Thread Atanas


Atanas said the following on 12/12/05 15:43:

Ronald Klop said the following on 12/12/05 13:27:


What happens if you set one of these sysctl values to 0? (This 
disables  SMP changes from 5.4 to 6.0.)

debug.mpsafevfs: 1
debug.mpsafenet: 1
debug.mpsafevm: 1


Thanks for the suggestion!
I just did so and rebooted both machines, so we'll see.


(replying to myself)
... and coincidentally or not, I got the next crash in less than 10 
minutes :-(


After the crash it ran for longer, until I rebooted it after rebuilding 
the kernel with debug hookups. Before the reboot I commented these out 
(i.e. set them back to 1), and now I'm waiting for a crashdump.


Regards,
Atanas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

41 matches

Mail list logo