Should the process of connecting the clusters automatically test out the 
connectivity both ways for us? Feature request for a future version?

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Howard, Stewart 
Jameson
Sent: Friday, 11 December 2015 2:14 AM
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] GPFS Remote Cluster Co-existence with CTDB/NFS 
Re-exporting

Hi Again Everybody,

Ok, so we got resolution on this.  Recall that I had said we'd just added ~300 
remote cluster GPFS clients and started having problems with CTDB the very same 
day...

Among those clients, there were three that had misconfigured firewalls, such 
that they could reach our home cluster nodes on port 1191, but our home cluster 
nodes could *not* reach them on 1191 *or* on any of the ephemeral ports.  This 
situation played absolute *havoc* with the stability of the filesystem.  From 
what we could tell, it seemed that these three nodes would establish a 
harmless-looking connection and mount the filesystem.  However, as soon as one 
of them acquired a resource (lock token or similar?) that the home cluster 
needed back...watch out!

In the GPFS logs on our side, we would see messages asking for the expulsion of 
these nodes about 4 - 5 times per day and a ton of messages about timeouts when 
trying to contact them.  These nodes would then re-join the cluster, since they 
could contact us, and this would entail repeated "delay N seconds for recovery" 
events.

During these recovery periods, the filesystem would become unresponsive for up 
to 60 or more seconds at a time.  This seemed to cause various NFS processes to 
fall on their faces.  Sometimes, the victim would be nfsd itself;  other times, 
it would be rpc.mountd.  CTDB would then come check on NFS, find that it was 
floundering, and start a recovery run.  To make things worse, at those very 
times the CTDB shared accounting files would *also* be unavailable since they 
reside on the same GPFS filesystem that they are serving (thanks to Doug for 
pointing out the flaw in this design and we're currently looking for an 
alternate home for these shared files).

This all added up to a *lot* of flapping, in NFS as well as with CTDB itself.  
However, the problems with CTDB/NFS were a *symptom* in this case, not a root 
cause.  The *cause* was the imperfect connectivity of just three out of 300 new 
clients.  I think the moral of the story here is this:  if you're adding remote 
cluster clients, make *absolutely* sure that all communications work going both 
ways between your home cluster and *every* new client.  If there is 
asymmetrical connectivity such as we had last week, you are in for one wild 
ride.  I would also point out that the flapping did not stop until we resolved 
connectivity for *all* of the clients, so remember that even having one single 
half-connected client is poisonous to your stability.

Thanks to everybody for all of your help!  Unless something changes, I'm 
declaring that our site is out of the woods on this one  :)

Stewart
________________________________________
From: [email protected] 
<[email protected]> on behalf of Sanchez, Paul 
<[email protected]>
Sent: Tuesday, December 8, 2015 5:00 PM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] GPFS Remote Cluster Co-existence with CTDB/NFS 
Re-exporting

One similar incident I've seen is if a filesystem is configured with too low a 
"-n numNodes" value for the number of nodes actually mounting (or remote 
mounting) the filesystem, then the cluster may become overloaded, lease 
renewals may be affected, and node expels may occur.

I'm sure we'll all be interested in a recap of what you actually discover here, 
when the problem is identified.

Thx
Paul

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Howard, Stewart 
Jameson
Sent: Tuesday, December 08, 2015 3:19 PM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] GPFS Remote Cluster Co-existence with CTDB/NFS 
Re-exporting

Hi All,

An update on this.  As events have unfolded, we have noticed a new symptom 
(cause?) that correlates very well, in time, with the instability we've been 
seeing on our protocol nodes.  Specifically, we are seeing three nodes among 
the remote-cluster clients that were recently deployed that are getting 
repeatedly expelled from the cluster and then recovered.

The expulsion-recovery cycles seem to go in fits and starts.  They usually last 
about 20 to 30 minutes and will involve one, two, or even three of these nodes 
getting expelled and then rejoining, sometimes as many as ten or twelve times 
before things calm down.  We're not sure if these expulsions are *causing* the 
troubles that we're having, but the fact that seem to coincide so well seems 
very suspicious.  Also, during one of these events yesterday, I myself saw a 
`cp` operation wait forever to start during a time period that later, from 
logs, appeared to be a expulsion-recovery cycle for one of these nodes.

Currently, we're investigating:

1)  Problems with networking hardware between our home cluster and these 
remote-cluster nodes.

2)  Misconfiguration of those nodes that breaks connectivity somehow.

3)  Load or resource depletion on the problem nodes that may cause them to be 
unresponsive.

On the CTDB front, we've increased CTDB's tolerance for unresponsiveness in the 
filesystem and hope that will at least keep the front end from going crazy when 
the filesystem becomes unresponsive.

Has anybody seen a cluster suffer so badly from membership-thrashing by 
remote-cluster nodes?  Is there a way to "blacklist" nodes that don't play 
nicely until they can be fixed?  Any suggestions of conditions that might cause 
repeated expulsions?

Thanks so much for your help!

Stewart
________________________________________
From: [email protected] 
<[email protected]> on behalf of Simon Thompson 
(Research Computing - IT Services) <[email protected]>
Sent: Tuesday, December 8, 2015 9:56 AM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] GPFS Remote Cluster Co-existence with CTDB/NFS 
Re-exporting

4.2.0 is out.

Simon
________________________________________
From: [email protected] 
[[email protected]] on behalf of Buterbaugh, Kevin L 
[[email protected]]
Sent: 08 December 2015 14:33
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] GPFS Remote Cluster Co-existence with CTDB/NFS 
Re-exporting

Hi Richard,

We went from GPFS 3.5.0.26 (where we also had zero problems with snapshot 
deletion) to GPFS 4.1.0.8 this past August and immediately hit the snapshot 
deletion bug (it's some sort of race condition).  It's not pleasant ... to 
recover we had to unmount the affected filesystem from both clusters, which 
didn't exactly make our researchers happy.

But the good news is that there is an efix available for it if you're on the 
4.1.0 series and I am 99% sure that the bug has also been fixed in the last 
several PTF's for the 4.1.1 series.

That's not the only bug we hit when going to 4.1.0.8 so my personal advice / 
opinion would be to bypass 4.1.0 and go straight to 4.1.1 or 4.2 when it comes 
out.  We are planning on going to 4.2 as soon as feasible ... it looks like 
it's much more stable plus has some new features (compression!) that we are 
very interested in.  Again, my 2 cents worth.

Kevin

On Dec 8, 2015, at 8:14 AM, Sobey, Richard A 
<[email protected]<mailto:[email protected]>> wrote:

This may not be at all applicable to your situation, but we're creating 
thousands of snapshots per day of many independent filesets. The same script(s) 
call mmdelsnapshot, too. We haven't seen any particular issues with this.

GPFS 3.5.

I note with intereste your bug report below about 4.1.0.x though - are you able 
to elaborate?

From: 
[email protected]<mailto:[email protected]>
 [mailto:[email protected]] On Behalf Of Buterbaugh, 
Kevin L
Sent: 07 December 2015 17:53
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] GPFS Remote Cluster Co-existence with CTDB/NFS 
Re-exporting

Hi Stewart,

We had been running mmcrsnapshot with a ~700 node remote cluster accessing the 
filesystem for a couple of years now without issue.

However, we haven't been running it for a little while because there is a very 
serious bug in GPFS 4.1.0.x relating to snapshot *deletion*.  There is an efix 
for it and we are in the process of rolling that out, but will not try to 
resume snapshots until both clusters are fully updated.

HTH...

Kevin

On Dec 7, 2015, at 11:23 AM, Howard, Stewart Jameson 
<[email protected]<mailto:[email protected]>> wrote:

Hi All,

Thanks to Doug and Kevin for the replies.  In answer to Kevin's question about 
our choice of clustering solution for NFS:  the choice was made hoping to 
maintain some simplicity by not using more than one HA solution at a time.  
However, it seems that this choice might have introduced more wrinkles than 
it's ironed out.

An update on our situation:  we have actually uncovered another clue since my 
last posting.  One thing that this now known to be correlated *very* closely 
with instability in the NFS layer is running `mmcrsnapshot`.    We had noticed 
that flapping happened like clockwork at midnight every night.  This happens to 
be the same time at which our crontab was running the `mmcrsnapshot` so, as an 
experiment, we moved the snapshot to happen at 1a.

After this change, the late-night flapping has moved to 1a and now happens 
reliably every night at that time.  I saw a post on this list from 2013 stating 
that `mmcrsnapshot` was known to hang up the filesystem with race conditions 
that result in deadlocks and am wondering if that is still a problem with the 
`mmcrsnapthost` command.  Running the snapshots had not been an obvious problem 
before, but seems to have become one since we deployed ~300 additional GPFS 
clients in a remote cluster configuration about a week ago.

Can anybody comment on the safety of running `mmcrsnapshot` with a ~300 node 
remote cluster accessing the filesystem?

Also, I would comment that this is not the only condition under which we see 
instability in the NFS layer.  We continue to see intermittent instability 
through the day.  The creation of a snapshot is simply the one well-correlated 
condition that we've discovered so far.

Thanks so much to everyone for your help  :)

Stewart
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org/>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-
Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced 
Computing Center for Research and Education 
[email protected]<mailto:[email protected]> - 
(615)875-9633



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org/>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-
Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced 
Computing Center for Research and Education 
[email protected]<mailto:[email protected]> - 
(615)875-9633



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to