I've run out of clues (EBRAINTOOSMALL) trying to solve an NFS puzzle and could use some help getting unstuck. Analysis is awkward because the customers in question are trying to make what use they can of the machines even as these problems are ocurring around them, so reboots and other dramatic acts have to be scheduled well in advance.
Symptoms: after approx 1 hour of apparently normal behavior, operations like 'df -k' or 'ls -l' hang for minutes at a time and then fail with I/O errors on any of the three machines when such operations refer to NFS mounted directories. At that point, doing this on all 3 machines: umount -f -l -a -t nfs ...followed by this: mount -a -t nfs ...on all 3 gets things unstuck for another hour. (?!?!) The 3 machines have NFS relationships thus: A mounts approx 6 directories from B (A->B) B mounts approx 6 directories from A (B->A) C mounts approx 6 directories from A (C->A) (same dirs as in B->A) C mounts approx 6 directories from B (C->B) (same dirs as in A->B) All systems are running x86_64 CentOS5.4 on HP xw8600 workstations connected via a Dell 2608 PowerConnect switch that's believed to be functioning properly. No jumbo packets. All MTUs are the standard 1500. I've tried specifying both UDP and TCP in the fstab lines. I've disabled selinux. The output of 'iptables -L' is: Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination These commands: service nfs status ; service portmap status ...indicate nominal conditions (all expected daemons reported running) when things are working but also when things are b0rken. There wasn't anything very informative in /var/log/messages with the default debug levels but messages are now accumulating there at firehose rates because I enabled debug for everything, thus: for m in rpc nfs nfsd nlm; do rpcdebug -m $m -s all; done After machine A exhibited the problem I *think* I see evidence in /var/log/messages that the NFS client code has decided it never got a response from the server (B) to some NFS request, so it retransmits the request and (I think) it then concludes that the retransmitted request also went unanswered so the operation is errored out. I gathered some Enet traffic for Wireshark anlysis on both machines thus: dumpcap -i eth0 -w /tmp/`hostname`.pcap ...and viewed the client traffic with Wireshark, which (apparently) confirms that the client did indeed wait a while and then (apparently) retransmitted the NFS request. The weird thing is that Wireshark analysis of corresponding traffic on the server shows the first request coming in and being turned around immediately, then we later see the retransmitted request arrive and it, too, is promptly processed and the response goes out immediately. So, if I'm reading these tea leaves properly it's as if that lost the ability to recognize the reply to that request. [?!] But, then, how could it be that all 3 machines seem to get into this state at more or less the same time? and why would unmounting and remounting all NFS filesystems then "fix" it? Aaaiiieeee!!! [ Unfortunately, this problem is only occuring at the one customer site and can't be reproduced in-house, so unless I can find a way to first sanitize the logs I may not be permitted to publish them here... >-/ ] _______________________________________________ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/