Hi Stewart,
Can't comment on NFS nor snapshot issues. However its common to change
filesystem parameters "maxMissedPingTimeout" and "minMissedPingTimeout"
when adding remote clusters.
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Tuning%20Parameters
Below is an earlier gpfsug thread on about remote cluster expels:
Re: [gpfsug-discuss] data interface and management infercace.
*Bob Oesterlin*oester at
gmail.com<mailto:gpfsug-discuss%40gpfsug.org?Subject=Re:%20Re%3A%20%5Bgpfsug-discuss%5D%20data%20interface%20and%20management%20infercace.&In-Reply-To=%3CCAMNdFvA8ZjY%3DM8LABsw93zXgE03jh-YzCXEYHS7rTDZue-OddA%40mail.gmail.com%3E>
/Mon Jul 13 18:42:47 BST 2015/
Some thoughts on node expels, based on the last 2-3 months of "expel hell"
here. We've spent a lot of time looking at this issue, across multiple
clusters. A big thanks to IBM for helping us center in on the right issues.
First, you need to understand if the expels are due to "expired lease"
message, or expels due to "communication issues". It sounds like you are
talking about the latter. In the case of nodes being expelled due to
communication issues, it's more likely the problem in related to network
congestion. This can occur at many levels - the node, the network, or the
switch.
When it's a communication issue, changing prams like "missed ping timeout"
isn't going to help you. The problem for us ended up being that GPFS wasn't
getting a response to a periodic "keep alive" poll to the node, and after
300 seconds, it declared the node dead and expelled it. You can tell if
this is the issue by starting to look at the RPC waiters just before the
expel. If you see something like "Waiting for poll on sock" RPC, that the
node is waiting for that periodic poll to return, and it's not seeing it.
The response is either lost in the network, sitting on the network queue,
or the node is too busy to send it. You may also see RPC's like "waiting
for exclusive use of connection" RPC - this is another clear indication of
network congestion.
Look at the GPFSUG presentions (http://www.gpfsug.org/presentations/) for
one by Jason Hick (NERSC) - he also talks about these issues. You need to
take a look at net.ipv4.tcp_wmem and net.ipv4.tcp_rmem, especially if you
have client nodes that are on slower network interfaces.
In our case, it was a number of factors - adjusting these settings,
looking at congestion at the switch level, and some physical hardware
issues.
Bob Oesterlin, Sr Storage Engineer, Nuance Communications
robert.oesterlin at nuance.com
<http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
chris hunter
[email protected]
-----Original Message-----
Sent: Friday, 11 December 2015 2:14 AM
To: gpfsug main discussion list<[email protected]>
Subject: Re: [gpfsug-discuss] GPFS Remote Cluster Co-existence with CTDB/NFS
Re-exporting
Hi Again Everybody,
Ok, so we got resolution on this. Recall that I had said we'd just added ~300
remote cluster GPFS clients and started having problems with CTDB the very same
day...
Among those clients, there were three that had misconfigured firewalls, such
that they could reach our home cluster nodes on port 1191, but our home cluster
nodes could*not* reach them on 1191*or* on any of the ephemeral ports. This
situation played absolute*havoc* with the stability of the filesystem. From
what we could tell, it seemed that these three nodes would establish a
harmless-looking connection and mount the filesystem. However, as soon as one
of them acquired a resource (lock token or similar?) that the home cluster
needed back...watch out!
In the GPFS logs on our side, we would see messages asking for the expulsion of these
nodes about 4 - 5 times per day and a ton of messages about timeouts when trying to
contact them. These nodes would then re-join the cluster, since they could contact us,
and this would entail repeated "delay N seconds for recovery" events.
During these recovery periods, the filesystem would become unresponsive for up
to 60 or more seconds at a time. This seemed to cause various NFS processes to
fall on their faces. Sometimes, the victim would be nfsd itself; other times,
it would be rpc.mountd. CTDB would then come check on NFS, find that it was
floundering, and start a recovery run. To make things worse, at those very
times the CTDB shared accounting files would*also* be unavailable since they
reside on the same GPFS filesystem that they are serving (thanks to Doug for
pointing out the flaw in this design and we're currently looking for an
alternate home for these shared files).
This all added up to a*lot* of flapping, in NFS as well as with CTDB itself.
However, the problems with CTDB/NFS were a*symptom* in this case, not a root
cause. The*cause* was the imperfect connectivity of just three out of 300 new
clients. I think the moral of the story here is this: if you're adding remote
cluster clients, make*absolutely* sure that all communications work going both
ways between your home cluster and*every* new client. If there is
asymmetrical connectivity such as we had last week, you are in for one wild
ride. I would also point out that the flapping did not stop until we resolved
connectivity for*all* of the clients, so remember that even having one single
half-connected client is poisonous to your stability.
Thanks to everybody for all of your help! Unless something changes, I'm
declaring that our site is out of the woods on this one
Stewart
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss