Re: 5.5-stable network interface rl0 stops working

2006-07-06 Thread Hank Hampel
Hi Roland,

On (060705), Roland Smith wrote:
  couple of weeks - the network interface rl0 (which is the main
  interface on the maschine, rl1 is for backups/internal use only) stops
 Are they physically on the motherboard? Or on PCI cards? In the latter
 case try reseating the card in the slot.

fortunately they are PCI cards, so I'll check the seating.

 Try switching rl0 and rl1, and see if te problem persists. Also,
 swapping out the ethernet cable is worth trying.

Switching/exchanging the cards was an option we haven't tried yet
although it came to my mind earlier - for sure the strangest problems
are hardware related so I'll give this a try and report back.

Swapping out the ethernet cable was one of the first things I checked
but to no avail. But I'm not really sure if the switch isn't part of
the problem (although all other ports function correctly) so I'll
change the switch port to.

 Another thing to check is if rl0 is sharing an interrupt with another
 device. That can cause problems.

No there is no interupt sharing for this device but thanks for this
hint, I hadn't checked it yet.

  When rl0 stops working ipfw loggs lots of denied packets so that it
  seems that the dynamic (keep-state) rules don't work any longer. We
 Does the problem persist without ipfw? I've got an rl0 card on my
 workstation (6.1-STABLE, amd64, using PF without problems)

Unfortunately I can't check this because we use ipfw to generate
traffic statistics for the jails. But when the interface stops working
it has no impact to disable the firewall, short of that no log messages
are generated any longer.

  After the stop on the interface occurs there is no other way to get
  the interface up and running again than rebooting the whole machine.
  Restarting /etc/rc.d/netif, the jails or ipfw doesn't help anything.
 What does ifconfig say after the interface stops working?

When the interface stops working ifconfig seems to think everything
is still ok. There is no hint in the output of ifconfig that the
interface is not working and ifconfig down/up doesn't help any.

 Anything in the logs, except the denied packets?

No strange enough there is no other hint in the logs that the system
is not working. At first I thought it was kind of an ipfw problem
because packets seem to arrive on the host but the responses get
blocked by ipfw. I'll check with tcpdump the next time it happens if
it's true that packets still arrive on the system.

On the other hand if ipfw is part of the problem (especially the
dynamic rules) then flushing ipfw should help I think - but it
doesn't. So maybe it's an hardware issue, I'll definitly check this
and report back. Thanks for the hints and tips!


Best regards, Hank


pgptYmaa0xylf.pgp
Description: PGP signature


5.5-stable network interface rl0 stops working

2006-07-05 Thread Hank Hampel
Hello everybody,

I have a very disturbing problem with one of our FreeBSD 5.5-stable
machines. It is a box on which ~10 jail systems run, each with
small to moderate network traffic.

Now from time to time - sometimes after a few days, sometimes after a
couple of weeks - the network interface rl0 (which is the main
interface on the maschine, rl1 is for backups/internal use only) stops
working.

Each jailed system has its own firewall ruleset, permitting only
traffic for the services in that specific jail. The packet filter used
is ipfw. Some of the rules are stateful (keep-state).

When rl0 stops working ipfw loggs lots of denied packets so that it
seems that the dynamic (keep-state) rules don't work any longer. We
checked and increased the buffers for the dynamic rules to no avail -
I doubt they are part of the problem. I'm not even sure ipfw is part
of the problem.

After the stop on the interface occurs there is no other way to get
the interface up and running again than rebooting the whole machine.
Restarting /etc/rc.d/netif, the jails or ipfw doesn't help anything.

The bad thing is I haven't found any way to trigger this problem so
that I can only check and change things and wait if the situation
improves or not. For example I've already set debug.mpsafenet=0 but
this doesn't help, in contrast it seems to worsen the problem a little
bit.

Find attached the dmesg output of the machine. If any other
information is needed to hunt down the cause of this problem please
let me know. I checked various list archives but haven't found a clue
yet.

-[ dmesg ]-
Copyright (c) 1992-2006 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 5.5-STABLE #5: Tue May 30 13:51:55 CEST 2006
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/SHAWSHANK
WARNING: MPSAFE network stack disabled, expect reduced performance.
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: Intel(R) Pentium(R) 4 CPU 2.40GHz (2411.60-MHz 686-class CPU)
  Origin = GenuineIntel  Id = 0xf34  Stepping = 4
  
Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
real memory  = 2147418112 (2047 MB)
avail memory = 2096037888 (1998 MB)
ACPI APIC Table: GBTAWRDACPI
ioapic0 Version 2.0 irqs 0-23 on motherboard
npx0: math processor on motherboard
npx0: INT 16 interface
acpi0: GBT AWRDACPI on motherboard
acpi0: Power Button (fixed)
Timecounter ACPI-fast frequency 3579545 Hz quality 1000
acpi_timer0: 24-bit timer at 3.579545MHz port 0x1008-0x100b on acpi0
cpu0: ACPI CPU on acpi0
acpi_button0: Power Button on acpi0
pcib0: ACPI Host-PCI bridge port 0x1000-0x10bf,0xcf8-0xcff on acpi0
pci0: ACPI PCI bus on pcib0
agp0: Intel 82865 host to AGP bridge mem 0xe800-0xefff at device 0.0 
on pci0
pcib1: PCI-PCI bridge at device 1.0 on pci0
pci1: PCI bus on pcib1
pcib2: ACPI PCI-PCI bridge at device 30.0 on pci0
pci2: ACPI PCI bus on pcib2
pci2: display, VGA at device 0.0 (no driver attached)
rl0: RealTek 8139 10/100BaseTX port 0x9000-0x90ff mem 0xf500-0xf5ff 
irq 21 at device 1.0 on pci2
miibus0: MII bus on rl0
rlphy0: RealTek internal media interface on miibus0
rlphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
rl0: Ethernet address: 00:02:2a:d5:39:74
rl1: RealTek 8139 10/100BaseTX port 0x9400-0x94ff mem 0xf5001000-0xf50010ff 
irq 22 at device 2.0 on pci2
miibus1: MII bus on rl1
rlphy1: RealTek internal media interface on miibus1
rlphy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
rl1: Ethernet address: 00:02:2a:d5:39:53
isab0: PCI-ISA bridge at device 31.0 on pci0
isa0: ISA bus on isab0
atapci0: Intel ICH5 UDMA100 controller port 
0xf000-0xf00f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 31.1 on pci0
ata0: channel #0 on atapci0
ata1: channel #1 on atapci0
pci0: serial bus, SMBus at device 31.3 (no driver attached)
acpi_tz0: Thermal Zone on acpi0
sio0: 16550A-compatible COM port port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
sio0: type 16550A, console
sio1: 16550A-compatible COM port port 0x2f8-0x2ff irq 3 on acpi0
sio1: type 16550A
pmtimer0 on isa0
orm0: ISA Option ROM at iomem 0xc-0xc7fff on isa0
sc0: System console at flags 0x100 on isa0
sc0: VGA 16 virtual consoles, flags=0x100
vga0: Generic ISA VGA at port 0x3c0-0x3df iomem 0xa-0xb on isa0
atkbdc0: Keyboard controller (i8042) at port 0x64,0x60 on isa0
atkbd0: AT Keyboard irq 1 on atkbdc0
kbd0 at atkbd0
ppc0: parallel port not found.
Timecounter TSC frequency 2411601876 Hz quality 800
Timecounters tick every 10.000 msec
ipfw2 initialized, divert disabled, rule-based forwarding disabled, default to 
deny, logging disabled
ad0: 114497MB STARDOM SohoRaid Mirror Rev:B2.7/Rev 2.7 [232629/16/63] at 
ata0-master UDMA100
acd0: DVDROM TOSHIBA DVD-ROM SD-M1912/TM01 at ata1-master PIO4
Mounting root from ufs:/dev/ad0s1a

Re: 5.5-stable network interface rl0 stops working

2006-07-05 Thread Roland Smith
On Wed, Jul 05, 2006 at 06:40:58PM +0200, Hank Hampel wrote:
 Hello everybody,
 
 I have a very disturbing problem with one of our FreeBSD 5.5-stable
 machines. It is a box on which ~10 jail systems run, each with
 small to moderate network traffic.
 
 Now from time to time - sometimes after a few days, sometimes after a
 couple of weeks - the network interface rl0 (which is the main
 interface on the maschine, rl1 is for backups/internal use only) stops
 working.

Are they physically on the motherboard? Or on PCI cards? In the latter
case try reseating the card in the slot.

Try switching rl0 and rl1, and see if te problem persists. Also,
swapping out the ethernet cable is worth trying.

Another thing to check is if rl0 is sharing an interrupt with another
device. That can cause problems.

 Each jailed system has its own firewall ruleset, permitting only
 traffic for the services in that specific jail. The packet filter used
 is ipfw. Some of the rules are stateful (keep-state).
 
 When rl0 stops working ipfw loggs lots of denied packets so that it
 seems that the dynamic (keep-state) rules don't work any longer. We
 checked and increased the buffers for the dynamic rules to no avail -
 I doubt they are part of the problem. I'm not even sure ipfw is part
 of the problem.

Does the problem persist without ipfw? I've got an rl0 card on my
workstation (6.1-STABLE, amd64, using PF without problems)

 After the stop on the interface occurs there is no other way to get
 the interface up and running again than rebooting the whole machine.
 Restarting /etc/rc.d/netif, the jails or ipfw doesn't help anything.

What does ifconfig say after the interface stops working?
 
 The bad thing is I haven't found any way to trigger this problem so
 that I can only check and change things and wait if the situation
 improves or not. For example I've already set debug.mpsafenet=0 but
 this doesn't help, in contrast it seems to worsen the problem a little
 bit.

 Find attached the dmesg output of the machine. If any other
 information is needed to hunt down the cause of this problem please
 let me know. I checked various list archives but haven't found a clue
 yet.

Anything in the logs, except the denied packets?

Roland
-- 
R.F.Smith   http://www.xs4all.nl/~rsmith/
[plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)


pgpqxfkKnhmwC.pgp
Description: PGP signature