Re: [CentOS] Help needed with NFS issue

2012-04-19 Thread Giovanni Tirloni
Jumping late on this thread, pardon my ignorance of some details...

On Wed, Apr 18, 2012 at 4:35 PM, Steve Thompson s...@vgersoft.com wrote:

 Interesting. It looks like some kind of RPC failure. During the hang, I
 cannot contact the nfs service via RPC:

 # rpcinfo -t server nfs
 rpcinfo: RPC: Timed out
 program 13 version 0 is not available



Did you run this command during the hang or is it constantly returning
you that?

If the later, are you blocking UDP on either the server or the client?


 # rpcinfo -p server
program vers proto   port
 102   tcp111  portmapper
 102   udp111  portmapper
 1000241   udp   1007  status
 1000241   tcp   1010  status
 1000211   udp  35077  nlockmgr
 1000213   udp  35077  nlockmgr
 1000214   udp  35077  nlockmgr
 1000211   tcp  56622  nlockmgr
 1000213   tcp  56622  nlockmgr
 1000214   tcp  56622  nlockmgr
 1000111   udp   1009  rquotad
 1000112   udp   1009  rquotad
 1000111   tcp   1012  rquotad
 1000112   tcp   1012  rquotad
 132   udp   2049  nfs
 133   udp   2049  nfs
 134   udp   2049  nfs
 132   tcp   2049  nfs
 133   tcp   2049  nfs
 134   tcp   2049  nfs
 151   udp605  mountd
 151   tcp608  mountd
 152   udp605  mountd
 152   tcp608  mountd
 153   udp605  mountd
 153   tcp608  mountd

 However, I can connect to the service via telnet:

 # telnet server nfs
 Trying ipaddr...
 Connected to server (ipaddr).
 Escape character is '^]'.


If you don't specify transport protocol, rpcinfo will use whatever is
defined in the /etc/netconfig database and that's usually UDP.

A couple of ideas/questions:

- Is it happening at the exact same minute (eg. 2:15, 2:45, 3:15, 3:45).
This might help you to identify a script/program that follows that schedule.
- Is there any configuration different between this server and the others?
/etc/system, root crontab, etc.
- When you say everything else BUT NFS is working fine, are pings answered
properly without increased latency during the hang ?
- What about other services? Can you set up a monitoring script connecting
to some other service (eg. ftp, ls, exit or ssh) and reporting the total
run time?
- Can you set up a monitoring script running rpcinfo on localhost to make
sure both local and remote communications hang?

-- 
Giovanni
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-19 Thread Steve Thompson
On Thu, 19 Apr 2012, Giovanni Tirloni wrote:

 Did you run this command during the hang or is it constantly returning
 you that?

It is returning the time out only during the hang; the rest of the time 
it works normally.

 If the later, are you blocking UDP on either the server or the client?

No blocking.

 If you don't specify transport protocol, rpcinfo will use whatever is
 defined in the /etc/netconfig database and that's usually UDP.

Using UDP or TCP makes no difference. rpcinfo -{u,t} host nfs both give 
a timeout during the hang, and work normally during other times.

 - Is it happening at the exact same minute (eg. 2:15, 2:45, 3:15, 3:45).
 This might help you to identify a script/program that follows that schedule.

It is not related to any script that I can find. It is not happening at 
_exactly_ the same time all the time, although it is similar within a few 
minutes.

 - Is there any configuration different between this server and the others?
 /etc/system, root crontab, etc.

No differences that I can find.

 - When you say everything else BUT NFS is working fine, are pings answered
 properly without increased latency during the hang ?

Yes. I can even run an iperf server on the host during the hang, and from
a client I run iperf -c and get normal performance.

 - What about other services? Can you set up a monitoring script connecting
 to some other service (eg. ftp, ls, exit or ssh) and reporting the total
 run time?

No other service appears to be impacted at all.

 - Can you set up a monitoring script running rpcinfo on localhost to make
 sure both local and remote communications hang?

Yes, can do.

-Steve
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-19 Thread Nataraj
Have you looked at the rpcd process with top or ps to see what state it
is in?  What about running strace?  What about your dns server or any
other (reverse) client lookup services that you might have enabled?

Nataraj

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-19 Thread Steve Thompson
All,

Many thanks to everyone who commented on this issue. I believe that I have 
solved it.

It turns out that the number of nfsd's that I was running (32) was way too 
low. I observed that adding more nfsd's when NFS was hung always caused 
the hang to go away immediately. Now I am in the tuning stage where I'm 
adding more nfsd's until there are no more hangs. I am up to 172 of them 
now, and the hang frequency has decreased by about a factor of six. 
Evidently my workload has changed when I wasn't looking closely enough. 
I'll probably end up with about 256 nfsd's.

For the sake of completeness, here's how to change the number of nfsd's on 
the fly:

echo 172  /proc/fs/nfsd/threads

and, of course, edit /etc/sysconfig/nfs to change RPCNFSDCOUNT to set the
value for the next boot.

Steve
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-18 Thread Steve Thompson
Interesting. It looks like some kind of RPC failure. During the hang, I 
cannot contact the nfs service via RPC:

# rpcinfo -t server nfs
rpcinfo: RPC: Timed out
program 13 version 0 is not available

even though it is supposedly available:

# rpcinfo -p server
program vers proto   port
 102   tcp111  portmapper
 102   udp111  portmapper
 1000241   udp   1007  status
 1000241   tcp   1010  status
 1000211   udp  35077  nlockmgr
 1000213   udp  35077  nlockmgr
 1000214   udp  35077  nlockmgr
 1000211   tcp  56622  nlockmgr
 1000213   tcp  56622  nlockmgr
 1000214   tcp  56622  nlockmgr
 1000111   udp   1009  rquotad
 1000112   udp   1009  rquotad
 1000111   tcp   1012  rquotad
 1000112   tcp   1012  rquotad
 132   udp   2049  nfs
 133   udp   2049  nfs
 134   udp   2049  nfs
 132   tcp   2049  nfs
 133   tcp   2049  nfs
 134   tcp   2049  nfs
 151   udp605  mountd
 151   tcp608  mountd
 152   udp605  mountd
 152   tcp608  mountd
 153   udp605  mountd
 153   tcp608  mountd

However, I can connect to the service via telnet:

# telnet server nfs
Trying ipaddr...
Connected to server (ipaddr).
Escape character is '^]'.

so the service is running but internally borked in some way.

Steve
-- 

Steve Thompson E-mail:  smt AT vgersoft DOT com
Voyager Software LLC   Web: http://www DOT vgersoft DOT com
39 Smugglers Path  VSW Support: support AT vgersoft DOT com
Ithaca, NY 14850
   186,282 miles per second: it's not just a good idea, it's the law

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-18 Thread Ross Walker
On Apr 18, 2012, at 3:35 PM, Steve Thompson s...@vgersoft.com wrote:

 Interesting. It looks like some kind of RPC failure. During the hang, I 
 cannot contact the nfs service via RPC:
 
 # rpcinfo -t server nfs
 rpcinfo: RPC: Timed out
 program 13 version 0 is not available
 
 even though it is supposedly available:
 
 # rpcinfo -p server
program vers proto   port
 102   tcp111  portmapper
 102   udp111  portmapper
 1000241   udp   1007  status
 1000241   tcp   1010  status
 1000211   udp  35077  nlockmgr
 1000213   udp  35077  nlockmgr
 1000214   udp  35077  nlockmgr
 1000211   tcp  56622  nlockmgr
 1000213   tcp  56622  nlockmgr
 1000214   tcp  56622  nlockmgr
 1000111   udp   1009  rquotad
 1000112   udp   1009  rquotad
 1000111   tcp   1012  rquotad
 1000112   tcp   1012  rquotad
 132   udp   2049  nfs
 133   udp   2049  nfs
 134   udp   2049  nfs
 132   tcp   2049  nfs
 133   tcp   2049  nfs
 134   tcp   2049  nfs
 151   udp605  mountd
 151   tcp608  mountd
 152   udp605  mountd
 152   tcp608  mountd
 153   udp605  mountd
 153   tcp608  mountd
 
 However, I can connect to the service via telnet:
 
 # telnet server nfs
 Trying ipaddr...
 Connected to server (ipaddr).
 Escape character is '^]'.
 
 so the service is running but internally borked in some way.

Is iptables disabled? If not, problem with rules or RPC helper?

What about selinux?

-Ross

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-18 Thread Steve Thompson
On Wed, 18 Apr 2012, Ross Walker wrote:

 Is iptables disabled? If not, problem with rules or RPC helper?

Yes, iptables is not in use.

 What about selinux?

Disabled.

-Steve
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-17 Thread Ross Walker
On Apr 17, 2012, at 5:40 PM, Steve Thompson s...@vgersoft.com wrote:

 I have four NFS servers running on Dell hardware (PE2900) under CentOS 
 5.7, x86_64. The number of NFS clients is about 170.
 
 A few days ago, one of the four, with no apparent changes, stopped 
 responding to NFS requests for two minutes every half an hour (approx). 
 Let's call this the hang. It has been doing this for four days now. 
 There are no log messages of any kind pertaining to this. The other three 
 servers are fine, although they are less loaded. Between hangs, 
 performance is excellent. Load is more or less constant, not peaky.
 
 NFS clients do get the usual not responding, still trying message during 
 a hang.
 
 There are no cron or other jobs that launch every half an hour.
 
 All hardware on the affected server seems to be good. Disk volumes being 
 served are RAID-5 sets with write-back cache enabled (BBU is good). RAID 
 controller logs are free of errors.
 
 NFS servers used dual bonded gigabit links in balance-alb mode. Turning 
 off one interface in the bond made no difference.
 
 Relevant /etc/sysctl.conf parameters:
 
 vm.dirty_ratio = 50
 vm.dirty_background_ratio = 1
 vm.dirty_expire_centisecs = 1000
 vm.dirty_writeback_centisecs = 100
 vm.min_free_kbytes = 65536
 net.core.rmem_default = 262144
 net.core.rmem_max = 262144
 net.core.wmem_default = 262144
 net.core.wmem_max = 262144
 net.core.netdev_max_backlog = 25000
 net.ipv4.tcp_reordering = 127
 net.ipv4.tcp_rmem = 4096 87380 16777216
 net.ipv4.tcp_wmem = 4096 65536 16777216
 net.ipv4.tcp_max_syn_backlog = 8192
 net.ipv4.tcp_no_metrics_save = 1
 
 The {r,w}mem_{max,default} values are twice what they were previously; 
 changing these had no effect.
 
 The number of dirty pages is nowhere near the dirty_ratio when the hangs 
 occur; there may be only 50MB of dirty memory.
 
 A local process on the NFS server is reading from disk at around 40-50 
 MB/sec on average; this continues unaffected during the hang, as do all 
 other network services on the host (eg an LDAP server). During the hang 
 the server seems to be quite snappy in all respects apart from NFS. The 
 network itself is fine as far as I can tell, and all NFS-related processes 
 on the server are intact.
 
 NFS mounts on clients are made with UDP or TCP with no difference in 
 results. A client mount cannot be completed (timed out) and access to an 
 already NFS mounted volume stalls during the hang (both automounted and 
 manual mounts).
 
 NFS block size is 32768 r and w; using 16384 makes no difference.
 
 Tcpdump shows no NFS packets exchanged between client and server during a 
 hang.
 
 I have not rebooted the affected server yet, but I have restarted NFS
 with no change.
 
 Help! I cannot figure out what is wrong, and I cannot find anything amiss. 
 I'm running out of something but I don't know what it is (except perhaps
 brains). Hints, please!

Just a shot in the dark here.

Take a look at the NIC and switch port flow control status during an outage, 
they may be paused due to switch load.

Is there anything else on the network switches that might flood them every half 
hour for a two minute duration?

-Ross

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-17 Thread Ross Walker
On Apr 17, 2012, at 6:49 PM, Ross Walker rswwal...@gmail.com wrote:

 Just a shot in the dark here.
 
 Take a look at the NIC and switch port flow control status during an outage, 
 they may be paused due to switch load.
 
 Is there anything else on the network switches that might flood them every 
 half hour for a two minute duration?

Let me also add that constant spanning tree convergence can cause this too. 
Make sure your choice of protocol and priority suit your topology and equipment.

-Ross

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-17 Thread Steve Thompson
On Tue, 17 Apr 2012, Ross Walker wrote:

 Take a look at the NIC and switch port flow control status during an outage, 
 they may be paused due to switch load.
 Is there anything else on the network switches that might flood them every 
 half hour for a two minute duration?

Unfortunately not. All of the NFS servers are on the same switch (an HP 
procurve) and only the one is having issues. The hang is always the
same length, too. Nice try though!

Steve
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-17 Thread Fajar Priyanto
Also shot in the dark from me. 
There maybe some IP conflict in the network. 

Sent from my iPhone
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-17 Thread Steve Thompson
On Tue, 17 Apr 2012, Ross Walker wrote:

 Let me also add that constant spanning tree convergence can cause this 
 too. Make sure your choice of protocol and priority suit your topology 
 and equipment.

Gives me an idea! The switch is under control of different people. I did 
have a new VLAN created for an unrelated purpose two days before this all 
started. Hmmm...

Steve
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-17 Thread Steve Thompson
On Wed, 18 Apr 2012, Fajar Priyanto wrote:

 Also shot in the dark from me.
 There maybe some IP conflict in the network.

Yes, I thought of that one too. I am in control of all IP's on the 
network, so I am sure that nothing changed around the time that the 
trouble started. I checked for that anyway :-(

Steve
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-17 Thread Ross Walker
On Apr 17, 2012, at 6:57 PM, Steve Thompson wrote:

 On Tue, 17 Apr 2012, Ross Walker wrote:
 
 Let me also add that constant spanning tree convergence can cause this 
 too. Make sure your choice of protocol and priority suit your topology 
 and equipment.
 
 Gives me an idea! The switch is under control of different people. I did 
 have a new VLAN created for an unrelated purpose two days before this all 
 started. Hmmm...

Maybe one of the ports of the bonded interfaces was assigned to this vlan 
causing LACP to break.

-Ross

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos