Re: NFS stalling on 8.1-STABLE

2010-08-21 Thread Jeremy Chadwick
On Tue, Aug 17, 2010 at 10:54:04AM -0700, Mark Morley wrote:
 On Sun, 15 Aug 2010 23:35:50 -0700 Jeremy Chadwick free...@jdc.parodius.com 
 wrote:
 On Thu, Aug 12, 2010 at 10:35:49AM -0700, Mark Morley wrote:
  I have five front end web servers that all mount their content from
  the same server via NFS.  If I stress the link on any one of the
  machines (eg: copy a large directory with a lot of files to/from the
  mounted file system) the client will pause.  That is, all processes
  trying to access that mount will freeze.  The log files with hundreds
  or thousands of nfs server not responding / is alive again messages.
  After 60 seconds it returns to normal, unless the load is still there
  in which case it continues to pause.
 
  This has only started happening since I upgraded the client machines
  to 8.1-STABLE (previously four of them were 8.0 and one was 7.3).  The
  server is 7.1-RELEASE-p11.  No other changes have taken place in terms
  of hardware or software or mount options, etc.
 
  All nics involved are gigabit em cards, and they are on a private
  network (web access to the boxes is via an external interface).
 
 Are there any indications in dmesg that the NIC is responsible, e.g.
 interface down/up, etc.?
 
 No, nothing like that.
 
 Does switching to UDP-based NFS solve the problem for you?
 
 Trying that now for the past 24 hours or so.  Four of the machine seem ok so 
 far, but the fifth one has started dropping the mount entirely.  Access to it 
 gives an Input / output error message.  Forcing a dismount and remounting 
 brings it back.
 
 What OS version (uname -a) and NIC are used on the NFS server?
 
 FreeBSD xxx 7.1-RELEASE-p11 FreeBSD 7.1-RELEASE-p11 #0: Wed May 26 03:20:59 
 PDT 2010
 r...@xxx:/usr/obj/usr/src/sys/CUSTOM  i386
 
 NICs are em
 
 Can you please provide the following output from one of the client
 machines running 8.1-STABLE with gigE em(4)?  You can X-out machine
 names, MAC addresses, and IP addresses/netblocks if need be.
 
 * uname -a
 
 FreeBSD xxx 8.1-STABLE FreeBSD 8.1-STABLE #0: Tue Jul 27 16:27:44 PDT 2010
 r...@xxx:/usr/obj/usr/src/sys/CUSTOM  amd64
 
 * ifconfig emX  (where X is the interface number which would be
   used for NFS)
 
 em0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500
 options=209bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC
 ether 00:0e:0c:85:d5:0d
 inet 192.168.1.30 netmask 0xff00 broadcast 192.168.1.255
 media: Ethernet 1000baseT full-duplex
 status: active
 
 * netstat -idn -I emX
 
 NameMtu Network   Address  Ipkts Ierrs IdropOpkts 
 Oerrs  Coll Drop
 em01500 Link#1  00:0e:0c:85:d5:0d 39913814 2 0 39949943 
 0 00
 em01500 192.168.1.0/2 192.168.1.30  39944016 - - 39949664 
 - --
 
 * pciconf -lvc  (provide only the data for emX please)
 
 e...@pci0:1:6:0: class=0x02 card=0x13768086 chip=0x107c8086 rev=0x05 
 hdr=0x00
 vendor = 'Intel Corporation'
 device = 'Gigabit Ethernet Controller (Copper) rev 5 (82541PI)'
 class  = network
 subclass   = ethernet
 cap 01[dc] = powerspec 2  supports D0 D3  current D0
 cap 07[e4] = PCI-X supports 2048 burst read, 1 split transaction
 
 * vmstat -i
 
 interrupt  total   rate
 irq1: atkbd0 239  0
 irq16: em0  36746591883
 irq18: em1  12658607304
  ^^^

I'm ignoring em1 because em0 is the one which has the NFS traffic, and
em1 could in fact be a different model of Intel NIC (it's very common
for server vendors to include two different models of NIC on the same
board; sure, both em(4), but different models), so I'm staying focused
on em0.

The interrupt rate here looks quite high for a system that may not be
doing anything (I don't know for sure).  Can you provide output from
netstat -I em0 -n -b 1 and let it run for about 60 seconds?  This
should be done both when NFS is UDP-only, as well as when NFS is
TCP-only.  I'm curious what kind of network throughput you're seeing (in
attempt to correlate it with high interrupt rates).  If network I/O is
very low yet the interrupt rate is very high, the problem may be a
driver bug or something with PCI configuration/initialisation.

I'm also CC'ing Jack Vogel of Intel, who may have some insight to what's
going on here.

 irq21: ohci0   2  0
 irq22: ehci0  528002 12
 irq23: atapci1   2334936 56
 cpu0: timer 83207296   2000
 cpu1: timer 83207289   2000
 Total  218682962   5256
 
 * sysctl hw.pci
 
 hw.pci.usb_early_takeover: 1
 hw.pci.honor_msi_blacklist: 1
 hw.pci.enable_msix: 1
 hw.pci.enable_msi: 1
 hw.pci.do_power_resume: 1
 hw.pci.do_power_nodriver: 0
 hw.pci.enable_io_modes: 1
 hw.pci.default_vgapci_unit: 

Re: NFS stalling on 8.1-STABLE

2010-08-17 Thread Mark Morley


On Sun, 15 Aug 2010 17:11:01 -0400 (EDT) Rick Macklem rmack...@uoguelph.ca 
wrote:
 Hi all,

 I have five front end web servers that all mount their content from the same 
 server via NFS.  If I stress the link on any one of the machines (eg: copy a 
 large directory with a lot of files to/from the mounted file system) the 
 client will pause.  That is, all processes trying to access that mount will 
 freeze.  The log files with hundreds or thousands of nfs server not 
 responding / is alive again messages. After 60 seconds it returns to normal, 
 unless the load is still there in which case it continues to pause.


The 60sec delay suggests that the client is doing a TCP reconnect. I'd suggest 
that you
look at a packet trace in wireshark (it knows how to decode NFS packets) and 
see if
there are new TCP connections (SYN, SYN-ACK,...) being made. If that is what is
happening, I suspect it is NIC driver related, but it is really hard to say.

I'll try this if/when it happens again.

If you can try a network interface of a different type (not em) that will 
check to
see if it is an em(4) issue.

Unfortunately I don't have any non-em cards around.

Alternately, you could try turning off the TSO and checksum offload stuff for 
the
em(4) and see if that helps.

Hmm, interesting.  The four machines that seem to be working (so far) have 
these enabled by default.  The fifth one has checksums enabled, but not TSO.  
Doesn't appear to support it.

I also tried switching from TCP to UDP.  This seems to be working (so far) on 
four of the clients (which happen to be identical load balanced machines), but 
on the fifth one (which serves a different purpose) I'm getting something 
really weird.  Instead of locking up periodically as before, it's actually 
losing the mount.  For example, a 'df' doesn't include the mounted system.  If 
I try to access the mounted system (with 'ls' for example) I get an Input / 
output error message.  I can remount it, but only after I force a dismount.

Mark
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: NFS stalling on 8.1-STABLE

2010-08-17 Thread Mark Morley


On Sun, 15 Aug 2010 23:35:50 -0700 Jeremy Chadwick free...@jdc.parodius.com 
wrote:
On Thu, Aug 12, 2010 at 10:35:49AM -0700, Mark Morley wrote:
 I have five front end web servers that all mount their content from
 the same server via NFS.  If I stress the link on any one of the
 machines (eg: copy a large directory with a lot of files to/from the
 mounted file system) the client will pause.  That is, all processes
 trying to access that mount will freeze.  The log files with hundreds
 or thousands of nfs server not responding / is alive again messages.
 After 60 seconds it returns to normal, unless the load is still there
 in which case it continues to pause.

 This has only started happening since I upgraded the client machines
 to 8.1-STABLE (previously four of them were 8.0 and one was 7.3).  The
 server is 7.1-RELEASE-p11.  No other changes have taken place in terms
 of hardware or software or mount options, etc.

 All nics involved are gigabit em cards, and they are on a private
 network (web access to the boxes is via an external interface).

Are there any indications in dmesg that the NIC is responsible, e.g.
interface down/up, etc.?

No, nothing like that.

Does switching to UDP-based NFS solve the problem for you?

Trying that now for the past 24 hours or so.  Four of the machine seem ok so 
far, but the fifth one has started dropping the mount entirely.  Access to it 
gives an Input / output error message.  Forcing a dismount and remounting 
brings it back.

What OS version (uname -a) and NIC are used on the NFS server?

FreeBSD xxx 7.1-RELEASE-p11 FreeBSD 7.1-RELEASE-p11 #0: Wed May 26 03:20:59 PDT 
2010
r...@xxx:/usr/obj/usr/src/sys/CUSTOM  i386

NICs are em

Can you please provide the following output from one of the client
machines running 8.1-STABLE with gigE em(4)?  You can X-out machine
names, MAC addresses, and IP addresses/netblocks if need be.

* uname -a

FreeBSD xxx 8.1-STABLE FreeBSD 8.1-STABLE #0: Tue Jul 27 16:27:44 PDT 2010
r...@xxx:/usr/obj/usr/src/sys/CUSTOM  amd64

* ifconfig emX  (where X is the interface number which would be
  used for NFS)

em0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500
options=209bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC
ether 00:0e:0c:85:d5:0d
inet 192.168.1.30 netmask 0xff00 broadcast 192.168.1.255
media: Ethernet 1000baseT full-duplex
status: active

* netstat -idn -I emX

NameMtu Network   Address  Ipkts Ierrs IdropOpkts Oerrs 
 Coll Drop
em01500 Link#1  00:0e:0c:85:d5:0d 39913814 2 0 39949943 0 
00
em01500 192.168.1.0/2 192.168.1.30  39944016 - - 39949664 - 
--


* pciconf -lvc  (provide only the data for emX please)

e...@pci0:1:6:0: class=0x02 card=0x13768086 chip=0x107c8086 rev=0x05 
hdr=0x00
vendor = 'Intel Corporation'
device = 'Gigabit Ethernet Controller (Copper) rev 5 (82541PI)'
class  = network
subclass   = ethernet
cap 01[dc] = powerspec 2  supports D0 D3  current D0
cap 07[e4] = PCI-X supports 2048 burst read, 1 split transaction


* vmstat -i

interrupt  total   rate
irq1: atkbd0 239  0
irq16: em0  36746591883
irq18: em1  12658607304
irq21: ohci0   2  0
irq22: ehci0  528002 12
irq23: atapci1   2334936 56
cpu0: timer 83207296   2000
cpu1: timer 83207289   2000
Total  218682962   5256

* sysctl hw.pci

hw.pci.usb_early_takeover: 1
hw.pci.honor_msi_blacklist: 1
hw.pci.enable_msix: 1
hw.pci.enable_msi: 1
hw.pci.do_power_resume: 1
hw.pci.do_power_nodriver: 0
hw.pci.enable_io_modes: 1
hw.pci.default_vgapci_unit: -1
hw.pci.host_mem_start: 2147483648
hw.pci.mcfg: 1

* As root, run sysctl dev.em.X.stats=1 then do dmesg and
  provide the output for NIC statistics (will start with emX:)

em0: Excessive collisions = 0
em0: Sequence errors = 0
em0: Defer count = 52
em0: Missed Packets = 0
em0: Receive No Buffers = 0
em0: Receive Length Errors = 0
em0: Receive errors = 1
em0: Crc errors = 1
em0: Alignment errors = 0
em0: Collision/Carrier extension errors = 0
em0: RX overruns = 0
em0: watchdog timeouts = 0
em0: RX MSIX IRQ = 0 TX MSIX IRQ = 0 LINK MSIX IRQ = 0
em0: XON Rcvd = 54
em0: XON Xmtd = 0
em0: XOFF Rcvd = 54
em0: XOFF Xmtd = 0
em0: Good Packets Rcvd = 39915088
em0: Good Packets Xmtd = 39951839

Mark
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: NFS stalling on 8.1-STABLE

2010-08-16 Thread Jeremy Chadwick
On Thu, Aug 12, 2010 at 10:35:49AM -0700, Mark Morley wrote:
 I have five front end web servers that all mount their content from
 the same server via NFS.  If I stress the link on any one of the
 machines (eg: copy a large directory with a lot of files to/from the
 mounted file system) the client will pause.  That is, all processes
 trying to access that mount will freeze.  The log files with hundreds
 or thousands of nfs server not responding / is alive again messages.
 After 60 seconds it returns to normal, unless the load is still there
 in which case it continues to pause.
 
 This has only started happening since I upgraded the client machines
 to 8.1-STABLE (previously four of them were 8.0 and one was 7.3).  The
 server is 7.1-RELEASE-p11.  No other changes have taken place in terms
 of hardware or software or mount options, etc.
 
 All nics involved are gigabit em cards, and they are on a private
 network (web access to the boxes is via an external interface).

Are there any indications in dmesg that the NIC is responsible, e.g.
interface down/up, etc.?

Does switching to UDP-based NFS solve the problem for you?

What OS version (uname -a) and NIC are used on the NFS server?

Can you please provide the following output from one of the client
machines running 8.1-STABLE with gigE em(4)?  You can X-out machine
names, MAC addresses, and IP addresses/netblocks if need be.

* uname -a
* ifconfig emX  (where X is the interface number which would be
  used for NFS)
* netstat -idn -I emX
* pciconf -lvc  (provide only the data for emX please)
* vmstat -i
* sysctl hw.pci
* As root, run sysctl dev.em.X.stats=1 then do dmesg and
  provide the output for NIC statistics (will start with emX:)

Thanks.

-- 
| Jeremy Chadwick   j...@parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: NFS stalling on 8.1-STABLE

2010-08-15 Thread Rick Macklem
 Hi all,
 
 I have five front end web servers that all mount their content from the same 
 server via NFS.  If I stress the link on any one of the machines (eg: copy a 
 large directory with a lot of files to/from the mounted file system) the 
 client will pause.  That is, all processes trying to access that mount will 
 freeze.  The log files with hundreds or thousands of nfs server not 
 responding / is alive again messages. After 60 seconds it returns to normal, 
 unless the load is still there in which case it continues to pause.
 

The 60sec delay suggests that the client is doing a TCP reconnect. I'd suggest 
that you
look at a packet trace in wireshark (it knows how to decode NFS packets) and 
see if
there are new TCP connections (SYN, SYN-ACK,...) being made. If that is what is
happening, I suspect it is NIC driver related, but it is really hard to say.

If you can try a network interface of a different type (not em) that will check 
to
see if it is an em(4) issue.

Alternately, you could try turning off the TSO and checksum offload stuff for 
the
em(4) and see if that helps.

 This has only started happening since I upgraded the client machines to 
 8.1-STABLE (previously four of them were 8.0 and one was 7.3).  The server is 
 7.1-RELEASE-p11.  No other changes have taken place in terms of hardware or 
 software or mount options, etc.
 

There were some client side fixes between 8.0 and 8.1, but I don't think any
of those have caused a regression w.r.t. connections. There is a problem w.r.t.
the nfsd getting in a loop, but that wouldn't recover after 60sec. (If it 
happens,
the server has to be rebooted. There is a fix for this at:
   http://people.freebsd.org/~rmacklem/freebsd8.1-patches/replay.patch
but I don't think it is what you are seeing.)

rick
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


NFS stalling on 8.1-STABLE

2010-08-12 Thread Mark Morley
Hi all,

I have five front end web servers that all mount their content from the same 
server via NFS.  If I stress the link on any one of the machines (eg: copy a 
large directory with a lot of files to/from the mounted file system) the client 
will pause.  That is, all processes trying to access that mount will freeze.  
The log files with hundreds or thousands of nfs server not responding / is 
alive again messages. After 60 seconds it returns to normal, unless the load is 
still there in which case it continues to pause.

This has only started happening since I upgraded the client machines to 
8.1-STABLE (previously four of them were 8.0 and one was 7.3).  The server is 
7.1-RELEASE-p11.  No other changes have taken place in terms of hardware or 
software or mount options, etc.

All nics involved are gigabit em cards, and they are on a private network (web 
access to the boxes is via an external interface).

If I truss a command such as df, it gets tonbsp;getfsstat() and pauses there.

Mount options are currently 
rw,tcp,nolockd,noatime,nosuid,bg,intr,soft,rsize=32768,wsize=32768 but I've 
tried all sorts of things and it doesn't seem to make a difference.

Here's a sample output from nfsstat -c from one of the boxes (uptime 14 days):

Client Info:
Rpc Counts:
Getattr   SetattrLookup  Readlink  Read WriteCreateRemove
75552107   3008653 300569929253365   2426554   4748471   2035545   3015497
Rename  Link   Symlink Mkdir Rmdir   Readdir  RdirPlusAccess
864598 50887  7462 11895   1137933  16160386 0  31593291
MknodFsstatFsinfo  PathConfCommit
0  22510271 5 0   3569465
Rpc Info:
TimedOut   Invalid X Replies   Retries  Requests
0 0 0 0 467516377
Cache Info:
Attr HitsMisses Lkup HitsMisses BioR HitsMisses BioW HitsMisses
1461457650  75552057 963440449 300536041  37404178   2359677   9467719   4748471
BioRLHitsMisses BioD HitsMisses DirE HitsMisses
14409992253365  29508747  16119060  22292421 23233

Any thoughts?

Mark
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
X-pstn-levels: (S:23.42978/99.9 CV:99.9000 FC:95.5390 LC:95.5390 
R:95.9108 P:95.9108 M:97.0282 C:98.6951 )
Message-ID: 1477820950330102083993269003...@psmtp.com
X-pstn-settings: 4 (1.5000:1.5000) s cv gt3 gt2 gt1 r p m c 
X-pstn-addresses: from m...@islandnet.com [294/10]