Re: [CentOS] Help needed with NFS issue
All, Many thanks to everyone who commented on this issue. I believe that I have solved it. It turns out that the number of nfsd's that I was running (32) was way too low. I observed that adding more nfsd's when NFS was hung always caused the hang to go away immediately. Now I am in the tuning stage where I'm adding more nfsd's until there are no more hangs. I am up to 172 of them now, and the hang frequency has decreased by about a factor of six. Evidently my workload has changed when I wasn't looking closely enough. I'll probably end up with about 256 nfsd's. For the sake of completeness, here's how to change the number of nfsd's on the fly: echo 172 > /proc/fs/nfsd/threads and, of course, edit /etc/sysconfig/nfs to change RPCNFSDCOUNT to set the value for the next boot. Steve ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Help needed with NFS issue
Have you looked at the rpcd process with top or ps to see what state it is in? What about running strace? What about your dns server or any other (reverse) client lookup services that you might have enabled? Nataraj ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Help needed with NFS issue
On Thu, 19 Apr 2012, Giovanni Tirloni wrote: > Did you run this command during "the hang" or is it constantly returning > you that? It is returning the time out only during the hang; the rest of the time it works normally. > If the later, are you blocking UDP on either the server or the client? No blocking. > If you don't specify transport protocol, rpcinfo will use whatever is > defined in the /etc/netconfig database and that's usually UDP. Using UDP or TCP makes no difference. "rpcinfo -{u,t} host nfs" both give a timeout during the hang, and work normally during other times. > - Is it happening at the exact same minute (eg. 2:15, 2:45, 3:15, 3:45). > This might help you to identify a script/program that follows that schedule. It is not related to any script that I can find. It is not happening at _exactly_ the same time all the time, although it is similar within a few minutes. > - Is there any configuration different between this server and the others? > /etc/system, root crontab, etc. No differences that I can find. > - When you say everything else BUT NFS is working fine, are pings answered > properly without increased latency during "the hang" ? Yes. I can even run an iperf server on the host during the hang, and from a client I run iperf -c and get normal performance. > - What about other services? Can you set up a monitoring script connecting > to some other service (eg. ftp, ls, exit or ssh) and reporting the total > run time? No other service appears to be impacted at all. > - Can you set up a monitoring script running "rpcinfo" on localhost to make > sure both local and remote communications hang? Yes, can do. -Steve ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Help needed with NFS issue
Jumping late on this thread, pardon my ignorance of some details... On Wed, Apr 18, 2012 at 4:35 PM, Steve Thompson wrote: > Interesting. It looks like some kind of RPC failure. During the hang, I > cannot contact the nfs service via RPC: > > # rpcinfo -t nfs > rpcinfo: RPC: Timed out > program 13 version 0 is not available > Did you run this command during "the hang" or is it constantly returning you that? If the later, are you blocking UDP on either the server or the client? > # rpcinfo -p >program vers proto port > 102 tcp111 portmapper > 102 udp111 portmapper > 1000241 udp 1007 status > 1000241 tcp 1010 status > 1000211 udp 35077 nlockmgr > 1000213 udp 35077 nlockmgr > 1000214 udp 35077 nlockmgr > 1000211 tcp 56622 nlockmgr > 1000213 tcp 56622 nlockmgr > 1000214 tcp 56622 nlockmgr > 1000111 udp 1009 rquotad > 1000112 udp 1009 rquotad > 1000111 tcp 1012 rquotad > 1000112 tcp 1012 rquotad > 132 udp 2049 nfs > 133 udp 2049 nfs > 134 udp 2049 nfs > 132 tcp 2049 nfs > 133 tcp 2049 nfs > 134 tcp 2049 nfs > 151 udp605 mountd > 151 tcp608 mountd > 152 udp605 mountd > 152 tcp608 mountd > 153 udp605 mountd > 153 tcp608 mountd > > However, I can connect to the service via telnet: > > # telnet nfs > Trying ... > Connected to (). > Escape character is '^]'. > If you don't specify transport protocol, rpcinfo will use whatever is defined in the /etc/netconfig database and that's usually UDP. A couple of ideas/questions: - Is it happening at the exact same minute (eg. 2:15, 2:45, 3:15, 3:45). This might help you to identify a script/program that follows that schedule. - Is there any configuration different between this server and the others? /etc/system, root crontab, etc. - When you say everything else BUT NFS is working fine, are pings answered properly without increased latency during "the hang" ? - What about other services? Can you set up a monitoring script connecting to some other service (eg. ftp, ls, exit or ssh) and reporting the total run time? - Can you set up a monitoring script running "rpcinfo" on localhost to make sure both local and remote communications hang? -- Giovanni ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Help needed with NFS issue
On Wed, 18 Apr 2012, Ross Walker wrote: > Is iptables disabled? If not, problem with rules or RPC helper? Yes, iptables is not in use. > What about selinux? Disabled. -Steve ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Help needed with NFS issue
On Apr 18, 2012, at 3:35 PM, Steve Thompson wrote: > Interesting. It looks like some kind of RPC failure. During the hang, I > cannot contact the nfs service via RPC: > > # rpcinfo -t nfs > rpcinfo: RPC: Timed out > program 13 version 0 is not available > > even though it is supposedly available: > > # rpcinfo -p >program vers proto port > 102 tcp111 portmapper > 102 udp111 portmapper > 1000241 udp 1007 status > 1000241 tcp 1010 status > 1000211 udp 35077 nlockmgr > 1000213 udp 35077 nlockmgr > 1000214 udp 35077 nlockmgr > 1000211 tcp 56622 nlockmgr > 1000213 tcp 56622 nlockmgr > 1000214 tcp 56622 nlockmgr > 1000111 udp 1009 rquotad > 1000112 udp 1009 rquotad > 1000111 tcp 1012 rquotad > 1000112 tcp 1012 rquotad > 132 udp 2049 nfs > 133 udp 2049 nfs > 134 udp 2049 nfs > 132 tcp 2049 nfs > 133 tcp 2049 nfs > 134 tcp 2049 nfs > 151 udp605 mountd > 151 tcp608 mountd > 152 udp605 mountd > 152 tcp608 mountd > 153 udp605 mountd > 153 tcp608 mountd > > However, I can connect to the service via telnet: > > # telnet nfs > Trying ... > Connected to (). > Escape character is '^]'. > > so the service is running but internally borked in some way. Is iptables disabled? If not, problem with rules or RPC helper? What about selinux? -Ross ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Help needed with NFS issue
Interesting. It looks like some kind of RPC failure. During the hang, I cannot contact the nfs service via RPC: # rpcinfo -t nfs rpcinfo: RPC: Timed out program 13 version 0 is not available even though it is supposedly available: # rpcinfo -p program vers proto port 102 tcp111 portmapper 102 udp111 portmapper 1000241 udp 1007 status 1000241 tcp 1010 status 1000211 udp 35077 nlockmgr 1000213 udp 35077 nlockmgr 1000214 udp 35077 nlockmgr 1000211 tcp 56622 nlockmgr 1000213 tcp 56622 nlockmgr 1000214 tcp 56622 nlockmgr 1000111 udp 1009 rquotad 1000112 udp 1009 rquotad 1000111 tcp 1012 rquotad 1000112 tcp 1012 rquotad 132 udp 2049 nfs 133 udp 2049 nfs 134 udp 2049 nfs 132 tcp 2049 nfs 133 tcp 2049 nfs 134 tcp 2049 nfs 151 udp605 mountd 151 tcp608 mountd 152 udp605 mountd 152 tcp608 mountd 153 udp605 mountd 153 tcp608 mountd However, I can connect to the service via telnet: # telnet nfs Trying ... Connected to (). Escape character is '^]'. so the service is running but internally borked in some way. Steve -- Steve Thompson E-mail: smt AT vgersoft DOT com Voyager Software LLC Web: http://www DOT vgersoft DOT com 39 Smugglers Path VSW Support: support AT vgersoft DOT com Ithaca, NY 14850 "186,282 miles per second: it's not just a good idea, it's the law" ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Help needed with NFS issue
On Apr 17, 2012, at 6:57 PM, Steve Thompson wrote: > On Tue, 17 Apr 2012, Ross Walker wrote: > >> Let me also add that constant spanning tree convergence can cause this >> too. Make sure your choice of protocol and priority suit your topology >> and equipment. > > Gives me an idea! The switch is under control of different people. I did > have a new VLAN created for an unrelated purpose two days before this all > started. Hmmm... Maybe one of the ports of the bonded interfaces was assigned to this vlan causing LACP to break. -Ross ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Help needed with NFS issue
On Wed, 18 Apr 2012, Fajar Priyanto wrote: > Also shot in the dark from me. > There maybe some IP conflict in the network. Yes, I thought of that one too. I am in control of all IP's on the network, so I am sure that nothing changed around the time that the trouble started. I checked for that anyway :-( Steve ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Help needed with NFS issue
On Tue, 17 Apr 2012, Ross Walker wrote: > Let me also add that constant spanning tree convergence can cause this > too. Make sure your choice of protocol and priority suit your topology > and equipment. Gives me an idea! The switch is under control of different people. I did have a new VLAN created for an unrelated purpose two days before this all started. Hmmm... Steve ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Help needed with NFS issue
Also shot in the dark from me. There maybe some IP conflict in the network. Sent from my iPhone ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Help needed with NFS issue
On Tue, 17 Apr 2012, Ross Walker wrote: > Take a look at the NIC and switch port flow control status during an outage, > they may be paused due to switch load. > Is there anything else on the network switches that might flood them every > half hour for a two minute duration? Unfortunately not. All of the NFS servers are on the same switch (an HP procurve) and only the one is having issues. The hang is always the same length, too. Nice try though! Steve ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Help needed with NFS issue
On Apr 17, 2012, at 6:49 PM, Ross Walker wrote: > Just a shot in the dark here. > > Take a look at the NIC and switch port flow control status during an outage, > they may be paused due to switch load. > > Is there anything else on the network switches that might flood them every > half hour for a two minute duration? Let me also add that constant spanning tree convergence can cause this too. Make sure your choice of protocol and priority suit your topology and equipment. -Ross ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Help needed with NFS issue
On Apr 17, 2012, at 5:40 PM, Steve Thompson wrote: > I have four NFS servers running on Dell hardware (PE2900) under CentOS > 5.7, x86_64. The number of NFS clients is about 170. > > A few days ago, one of the four, with no apparent changes, stopped > responding to NFS requests for two minutes every half an hour (approx). > Let's call this "the hang". It has been doing this for four days now. > There are no log messages of any kind pertaining to this. The other three > servers are fine, although they are less loaded. Between hangs, > performance is excellent. Load is more or less constant, not peaky. > > NFS clients do get the usual "not responding, still trying" message during > a hang. > > There are no cron or other jobs that launch every half an hour. > > All hardware on the affected server seems to be good. Disk volumes being > served are RAID-5 sets with write-back cache enabled (BBU is good). RAID > controller logs are free of errors. > > NFS servers used dual bonded gigabit links in balance-alb mode. Turning > off one interface in the bond made no difference. > > Relevant /etc/sysctl.conf parameters: > > vm.dirty_ratio = 50 > vm.dirty_background_ratio = 1 > vm.dirty_expire_centisecs = 1000 > vm.dirty_writeback_centisecs = 100 > vm.min_free_kbytes = 65536 > net.core.rmem_default = 262144 > net.core.rmem_max = 262144 > net.core.wmem_default = 262144 > net.core.wmem_max = 262144 > net.core.netdev_max_backlog = 25000 > net.ipv4.tcp_reordering = 127 > net.ipv4.tcp_rmem = 4096 87380 16777216 > net.ipv4.tcp_wmem = 4096 65536 16777216 > net.ipv4.tcp_max_syn_backlog = 8192 > net.ipv4.tcp_no_metrics_save = 1 > > The {r,w}mem_{max,default} values are twice what they were previously; > changing these had no effect. > > The number of dirty pages is nowhere near the dirty_ratio when the hangs > occur; there may be only 50MB of dirty memory. > > A local process on the NFS server is reading from disk at around 40-50 > MB/sec on average; this continues unaffected during the hang, as do all > other network services on the host (eg an LDAP server). During the hang > the server seems to be quite snappy in all respects apart from NFS. The > network itself is fine as far as I can tell, and all NFS-related processes > on the server are intact. > > NFS mounts on clients are made with UDP or TCP with no difference in > results. A client mount cannot be completed ("timed out") and access to an > already NFS mounted volume stalls during the hang (both automounted and > manual mounts). > > NFS block size is 32768 r and w; using 16384 makes no difference. > > Tcpdump shows no NFS packets exchanged between client and server during a > hang. > > I have not rebooted the affected server yet, but I have restarted NFS > with no change. > > Help! I cannot figure out what is wrong, and I cannot find anything amiss. > I'm running out of something but I don't know what it is (except perhaps > brains). Hints, please! Just a shot in the dark here. Take a look at the NIC and switch port flow control status during an outage, they may be paused due to switch load. Is there anything else on the network switches that might flood them every half hour for a two minute duration? -Ross ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos