Re: [gpfsug-discuss] GPFS Remote Cluster Co-existence with CTDB/NFS Re-exporting

Buterbaugh, Kevin L Fri, 04 Dec 2015 07:00:09 -0800

Hi Stewart,

We use the GPFS CNFS solution for NFS mounts and Sernet-Samba and CTDB for 
SAMBA mounts and that works well for us overall (we’ve been using this solution 
for over 2 years at this point).  I guess I would ask why you chose to use CTDB 
instead of CNFS for NFS mounts??


I’ll also add that we are eagerly looking forward to doing some upgrades so 
that we can potentially use the GPFS Cluster Export Services mechanism going 
forward…

Kevin

On Dec 4, 2015, at 7:00 AM, Hughes, Doug 
<[email protected]<mailto:[email protected]>> 
wrote:

One thing that we discovered very early on using CTDB (or CNFS for that matter) 
with GPFS is the importance of having the locking/sharing part of ctdb *not* be 
on the same filesystem that it is exporting. If they are the same, then as soon 
as the back-end main filesystem gets heavily loaded, ctdb will start timing out 
tickles and then you'll have all kinds of intermittent and inconvenient 
failures, often with manual recovery needed afterwards. We took some of the 
flash that we use for metadata and created a special cluster filesystem on that 
that has the ctdb locking database on it. Now, if the back-end main filesystem 
gets slow, it's just slow for all clients, instead of slow for GPFS clients and 
unavailable for NFS clients because all of the ctdb checks have failed.


Sent from my android device.

-----Original Message-----
From: "Howard, Stewart Jameson" <[email protected]<mailto:[email protected]>>
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Cc: "Garrison, E Chris" <[email protected]<mailto:[email protected]>>
Sent: Thu, 03 Dec 2015 22:45
Subject: [gpfsug-discuss] GPFS Remote Cluster Co-existence with CTDB/NFS 
Re-exporting

Hi All,

At our site, we have very recently (as of ~48 hours ago) configured one of our 
supercomputers (an x86 cluster containing about 315 nodes) to be a GPFS client 
cluster and to access our core GPFS cluster using a remote mount, per the 
instuctions in the GFPS Advanced Administration Guide. In addition to allowing 
remote access from this newly-configured client cluster, we also export the 
filesystem via NFSv3 to two other supercomputers in our data center. We do not 
use the GPFS CNFS solution to provide NFS mounts. Instead, we use CTDB to 
manage NFS on the four core-cluster client nodes that re-export the filesystem.

The exports of NFSv3 managed by CTDB pre-date the client GPFS cluster 
deployment. Since deploying GPFS clients onto the one supercomputer, we have 
been experiencing a great deal of flapping in our CTDB layer. It's difficult to 
sort out what is causing what, but I can identify a handful of the symptoms 
that we're seeing:

1) In the CTDB logs of all the NFS server nodes, we see numerous complaints (on 
some nodes this is multiple times a day) that rpc.mountd is not running and is 
being restarted, i.e.,

“ERROR: MOUNTD is not running. Trying to restart it.”

2) In syslog, rpc.mountd can be seen complaining that it is unable to bind to a 
socket and that an address is already in use, i.e.,

“rpc.mountd[16869]: Could not bind socket: (98) Address already in use”

The rpc.mountd daemon on these nodes is manually constrained to use port 597. 
The mountd daemon seems able to listen for UDP connections on this port, but 
not for TCP connections. However, investigating `lsof` and `netstat` reveals no 
process that is using port 597 and preventing rpc.mountd from using it.

3) We also see nfsd failing its CTDB health check several times a day, i.e.,

“Event script timed out : 60.nfs monitor count : 0 pid : 7172”

Both the non-running state of rpc.mountd and the failure of nfsd to pass its 
CTDB health checks are causing multiple nodes in the NFS export cluster to 
become “UNHEALTHY” (the CTDB designation for it) multiple times a day, 
resulting in a lot of flapping and passing IP addresses back and forth.

I should mention here that nfsd on these nodes was running without any problems 
for the last month up until the night when we deployed the GPFS client cluster. 
After that deployment, the host of problems listed above suddenly started up. I 
should also mention that the new client GPFS cluster is running quite nicely, 
although it is generating a lot more open network sockets on the core-cluster 
side. We believe that the NFS problems starting at the same time as the GPFS 
client deployment is not a coincidence, and are inclined to conclude that 
something about deploying GPFS clients on the supercomputer in question is 
destabilizing the NFS instances running on the clients that belong to the core 
cluster.

Our current hypothesis is that introducing all of these new GPFS clients has 
caused contention for some resource on the core-cluster client nodes (ports?, 
open file handles?, something else?) and GPFS is winning out over NFS.

Does anyone have experience with running NFS and GPFS together in such an 
environment, especially with CTDB as a high-availability daemon? Has anyone 
perhaps seen these kinds of problems before or have any ideas as to what may be 
causing them?

We're happy to provide any additional diagnostics that the group would like to 
see in order to investigate. As always, we very much appreciate any help that 
you are able to provide.

Thank you so much!

Stewart Howard
Indiana University
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org/>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
[email protected]<mailto:[email protected]> - 
(615)875-9633

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] GPFS Remote Cluster Co-existence with CTDB/NFS Re-exporting

Reply via email to