Re: [CentOS] Help needed with NFS issue

2012-04-19 Thread Steve Thompson
All,

Many thanks to everyone who commented on this issue. I believe that I have 
solved it.

It turns out that the number of nfsd's that I was running (32) was way too 
low. I observed that adding more nfsd's when NFS was hung always caused 
the hang to go away immediately. Now I am in the tuning stage where I'm 
adding more nfsd's until there are no more hangs. I am up to 172 of them 
now, and the hang frequency has decreased by about a factor of six. 
Evidently my workload has changed when I wasn't looking closely enough. 
I'll probably end up with about 256 nfsd's.

For the sake of completeness, here's how to change the number of nfsd's on 
the fly:

echo 172 > /proc/fs/nfsd/threads

and, of course, edit /etc/sysconfig/nfs to change RPCNFSDCOUNT to set the
value for the next boot.

Steve
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-19 Thread Nataraj
Have you looked at the rpcd process with top or ps to see what state it
is in?  What about running strace?  What about your dns server or any
other (reverse) client lookup services that you might have enabled?

Nataraj

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-19 Thread Steve Thompson
On Thu, 19 Apr 2012, Giovanni Tirloni wrote:

> Did you run this command during "the hang" or is it constantly returning
> you that?

It is returning the time out only during the hang; the rest of the time 
it works normally.

> If the later, are you blocking UDP on either the server or the client?

No blocking.

> If you don't specify transport protocol, rpcinfo will use whatever is
> defined in the /etc/netconfig database and that's usually UDP.

Using UDP or TCP makes no difference. "rpcinfo -{u,t} host nfs" both give 
a timeout during the hang, and work normally during other times.

> - Is it happening at the exact same minute (eg. 2:15, 2:45, 3:15, 3:45).
> This might help you to identify a script/program that follows that schedule.

It is not related to any script that I can find. It is not happening at 
_exactly_ the same time all the time, although it is similar within a few 
minutes.

> - Is there any configuration different between this server and the others?
> /etc/system, root crontab, etc.

No differences that I can find.

> - When you say everything else BUT NFS is working fine, are pings answered
> properly without increased latency during "the hang" ?

Yes. I can even run an iperf server on the host during the hang, and from
a client I run iperf -c and get normal performance.

> - What about other services? Can you set up a monitoring script connecting
> to some other service (eg. ftp, ls, exit or ssh) and reporting the total
> run time?

No other service appears to be impacted at all.

> - Can you set up a monitoring script running "rpcinfo" on localhost to make
> sure both local and remote communications hang?

Yes, can do.

-Steve
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-19 Thread Giovanni Tirloni
Jumping late on this thread, pardon my ignorance of some details...

On Wed, Apr 18, 2012 at 4:35 PM, Steve Thompson  wrote:

> Interesting. It looks like some kind of RPC failure. During the hang, I
> cannot contact the nfs service via RPC:
>
> # rpcinfo -t  nfs
> rpcinfo: RPC: Timed out
> program 13 version 0 is not available
>


Did you run this command during "the hang" or is it constantly returning
you that?

If the later, are you blocking UDP on either the server or the client?


> # rpcinfo -p 
>program vers proto   port
> 102   tcp111  portmapper
> 102   udp111  portmapper
> 1000241   udp   1007  status
> 1000241   tcp   1010  status
> 1000211   udp  35077  nlockmgr
> 1000213   udp  35077  nlockmgr
> 1000214   udp  35077  nlockmgr
> 1000211   tcp  56622  nlockmgr
> 1000213   tcp  56622  nlockmgr
> 1000214   tcp  56622  nlockmgr
> 1000111   udp   1009  rquotad
> 1000112   udp   1009  rquotad
> 1000111   tcp   1012  rquotad
> 1000112   tcp   1012  rquotad
> 132   udp   2049  nfs
> 133   udp   2049  nfs
> 134   udp   2049  nfs
> 132   tcp   2049  nfs
> 133   tcp   2049  nfs
> 134   tcp   2049  nfs
> 151   udp605  mountd
> 151   tcp608  mountd
> 152   udp605  mountd
> 152   tcp608  mountd
> 153   udp605  mountd
> 153   tcp608  mountd
>
> However, I can connect to the service via telnet:
>
> # telnet  nfs
> Trying ...
> Connected to  ().
> Escape character is '^]'.
>

If you don't specify transport protocol, rpcinfo will use whatever is
defined in the /etc/netconfig database and that's usually UDP.

A couple of ideas/questions:

- Is it happening at the exact same minute (eg. 2:15, 2:45, 3:15, 3:45).
This might help you to identify a script/program that follows that schedule.
- Is there any configuration different between this server and the others?
/etc/system, root crontab, etc.
- When you say everything else BUT NFS is working fine, are pings answered
properly without increased latency during "the hang" ?
- What about other services? Can you set up a monitoring script connecting
to some other service (eg. ftp, ls, exit or ssh) and reporting the total
run time?
- Can you set up a monitoring script running "rpcinfo" on localhost to make
sure both local and remote communications hang?

-- 
Giovanni
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-18 Thread Steve Thompson
On Wed, 18 Apr 2012, Ross Walker wrote:

> Is iptables disabled? If not, problem with rules or RPC helper?

Yes, iptables is not in use.

> What about selinux?

Disabled.

-Steve
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-18 Thread Ross Walker
On Apr 18, 2012, at 3:35 PM, Steve Thompson  wrote:

> Interesting. It looks like some kind of RPC failure. During the hang, I 
> cannot contact the nfs service via RPC:
> 
> # rpcinfo -t  nfs
> rpcinfo: RPC: Timed out
> program 13 version 0 is not available
> 
> even though it is supposedly available:
> 
> # rpcinfo -p 
>program vers proto   port
> 102   tcp111  portmapper
> 102   udp111  portmapper
> 1000241   udp   1007  status
> 1000241   tcp   1010  status
> 1000211   udp  35077  nlockmgr
> 1000213   udp  35077  nlockmgr
> 1000214   udp  35077  nlockmgr
> 1000211   tcp  56622  nlockmgr
> 1000213   tcp  56622  nlockmgr
> 1000214   tcp  56622  nlockmgr
> 1000111   udp   1009  rquotad
> 1000112   udp   1009  rquotad
> 1000111   tcp   1012  rquotad
> 1000112   tcp   1012  rquotad
> 132   udp   2049  nfs
> 133   udp   2049  nfs
> 134   udp   2049  nfs
> 132   tcp   2049  nfs
> 133   tcp   2049  nfs
> 134   tcp   2049  nfs
> 151   udp605  mountd
> 151   tcp608  mountd
> 152   udp605  mountd
> 152   tcp608  mountd
> 153   udp605  mountd
> 153   tcp608  mountd
> 
> However, I can connect to the service via telnet:
> 
> # telnet  nfs
> Trying ...
> Connected to  ().
> Escape character is '^]'.
> 
> so the service is running but internally borked in some way.

Is iptables disabled? If not, problem with rules or RPC helper?

What about selinux?

-Ross

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-18 Thread Steve Thompson
Interesting. It looks like some kind of RPC failure. During the hang, I 
cannot contact the nfs service via RPC:

# rpcinfo -t  nfs
rpcinfo: RPC: Timed out
program 13 version 0 is not available

even though it is supposedly available:

# rpcinfo -p 
program vers proto   port
 102   tcp111  portmapper
 102   udp111  portmapper
 1000241   udp   1007  status
 1000241   tcp   1010  status
 1000211   udp  35077  nlockmgr
 1000213   udp  35077  nlockmgr
 1000214   udp  35077  nlockmgr
 1000211   tcp  56622  nlockmgr
 1000213   tcp  56622  nlockmgr
 1000214   tcp  56622  nlockmgr
 1000111   udp   1009  rquotad
 1000112   udp   1009  rquotad
 1000111   tcp   1012  rquotad
 1000112   tcp   1012  rquotad
 132   udp   2049  nfs
 133   udp   2049  nfs
 134   udp   2049  nfs
 132   tcp   2049  nfs
 133   tcp   2049  nfs
 134   tcp   2049  nfs
 151   udp605  mountd
 151   tcp608  mountd
 152   udp605  mountd
 152   tcp608  mountd
 153   udp605  mountd
 153   tcp608  mountd

However, I can connect to the service via telnet:

# telnet  nfs
Trying ...
Connected to  ().
Escape character is '^]'.

so the service is running but internally borked in some way.

Steve
-- 

Steve Thompson E-mail:  smt AT vgersoft DOT com
Voyager Software LLC   Web: http://www DOT vgersoft DOT com
39 Smugglers Path  VSW Support: support AT vgersoft DOT com
Ithaca, NY 14850
   "186,282 miles per second: it's not just a good idea, it's the law"

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-17 Thread Ross Walker
On Apr 17, 2012, at 6:57 PM, Steve Thompson wrote:

> On Tue, 17 Apr 2012, Ross Walker wrote:
> 
>> Let me also add that constant spanning tree convergence can cause this 
>> too. Make sure your choice of protocol and priority suit your topology 
>> and equipment.
> 
> Gives me an idea! The switch is under control of different people. I did 
> have a new VLAN created for an unrelated purpose two days before this all 
> started. Hmmm...

Maybe one of the ports of the bonded interfaces was assigned to this vlan 
causing LACP to break.

-Ross

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-17 Thread Steve Thompson
On Wed, 18 Apr 2012, Fajar Priyanto wrote:

> Also shot in the dark from me.
> There maybe some IP conflict in the network.

Yes, I thought of that one too. I am in control of all IP's on the 
network, so I am sure that nothing changed around the time that the 
trouble started. I checked for that anyway :-(

Steve
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-17 Thread Steve Thompson
On Tue, 17 Apr 2012, Ross Walker wrote:

> Let me also add that constant spanning tree convergence can cause this 
> too. Make sure your choice of protocol and priority suit your topology 
> and equipment.

Gives me an idea! The switch is under control of different people. I did 
have a new VLAN created for an unrelated purpose two days before this all 
started. Hmmm...

Steve
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-17 Thread Fajar Priyanto
Also shot in the dark from me. 
There maybe some IP conflict in the network. 

Sent from my iPhone
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-17 Thread Steve Thompson
On Tue, 17 Apr 2012, Ross Walker wrote:

> Take a look at the NIC and switch port flow control status during an outage, 
> they may be paused due to switch load.
> Is there anything else on the network switches that might flood them every 
> half hour for a two minute duration?

Unfortunately not. All of the NFS servers are on the same switch (an HP 
procurve) and only the one is having issues. The hang is always the
same length, too. Nice try though!

Steve
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-17 Thread Ross Walker
On Apr 17, 2012, at 6:49 PM, Ross Walker  wrote:

> Just a shot in the dark here.
> 
> Take a look at the NIC and switch port flow control status during an outage, 
> they may be paused due to switch load.
> 
> Is there anything else on the network switches that might flood them every 
> half hour for a two minute duration?

Let me also add that constant spanning tree convergence can cause this too. 
Make sure your choice of protocol and priority suit your topology and equipment.

-Ross

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Help needed with NFS issue

2012-04-17 Thread Ross Walker
On Apr 17, 2012, at 5:40 PM, Steve Thompson  wrote:

> I have four NFS servers running on Dell hardware (PE2900) under CentOS 
> 5.7, x86_64. The number of NFS clients is about 170.
> 
> A few days ago, one of the four, with no apparent changes, stopped 
> responding to NFS requests for two minutes every half an hour (approx). 
> Let's call this "the hang". It has been doing this for four days now. 
> There are no log messages of any kind pertaining to this. The other three 
> servers are fine, although they are less loaded. Between hangs, 
> performance is excellent. Load is more or less constant, not peaky.
> 
> NFS clients do get the usual "not responding, still trying" message during 
> a hang.
> 
> There are no cron or other jobs that launch every half an hour.
> 
> All hardware on the affected server seems to be good. Disk volumes being 
> served are RAID-5 sets with write-back cache enabled (BBU is good). RAID 
> controller logs are free of errors.
> 
> NFS servers used dual bonded gigabit links in balance-alb mode. Turning 
> off one interface in the bond made no difference.
> 
> Relevant /etc/sysctl.conf parameters:
> 
> vm.dirty_ratio = 50
> vm.dirty_background_ratio = 1
> vm.dirty_expire_centisecs = 1000
> vm.dirty_writeback_centisecs = 100
> vm.min_free_kbytes = 65536
> net.core.rmem_default = 262144
> net.core.rmem_max = 262144
> net.core.wmem_default = 262144
> net.core.wmem_max = 262144
> net.core.netdev_max_backlog = 25000
> net.ipv4.tcp_reordering = 127
> net.ipv4.tcp_rmem = 4096 87380 16777216
> net.ipv4.tcp_wmem = 4096 65536 16777216
> net.ipv4.tcp_max_syn_backlog = 8192
> net.ipv4.tcp_no_metrics_save = 1
> 
> The {r,w}mem_{max,default} values are twice what they were previously; 
> changing these had no effect.
> 
> The number of dirty pages is nowhere near the dirty_ratio when the hangs 
> occur; there may be only 50MB of dirty memory.
> 
> A local process on the NFS server is reading from disk at around 40-50 
> MB/sec on average; this continues unaffected during the hang, as do all 
> other network services on the host (eg an LDAP server). During the hang 
> the server seems to be quite snappy in all respects apart from NFS. The 
> network itself is fine as far as I can tell, and all NFS-related processes 
> on the server are intact.
> 
> NFS mounts on clients are made with UDP or TCP with no difference in 
> results. A client mount cannot be completed ("timed out") and access to an 
> already NFS mounted volume stalls during the hang (both automounted and 
> manual mounts).
> 
> NFS block size is 32768 r and w; using 16384 makes no difference.
> 
> Tcpdump shows no NFS packets exchanged between client and server during a 
> hang.
> 
> I have not rebooted the affected server yet, but I have restarted NFS
> with no change.
> 
> Help! I cannot figure out what is wrong, and I cannot find anything amiss. 
> I'm running out of something but I don't know what it is (except perhaps
> brains). Hints, please!

Just a shot in the dark here.

Take a look at the NIC and switch port flow control status during an outage, 
they may be paused due to switch load.

Is there anything else on the network switches that might flood them every half 
hour for a two minute duration?

-Ross

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos