Hi All,

At our site, we have very recently (as of ~48 hours ago) configured one of our 
supercomputers (an x86 cluster containing about 315 nodes) to be a GPFS client 
cluster and to access our core GPFS cluster using a remote mount, per the 
instuctions in the GFPS Advanced Administration Guide. In addition to allowing 
remote access from this newly-configured client cluster, we also export the 
filesystem via NFSv3 to two other supercomputers in our data center. We do not 
use the GPFS CNFS solution to provide NFS mounts. Instead, we use CTDB to 
manage NFS on the four core-cluster client nodes that re-export the filesystem.

The exports of NFSv3 managed by CTDB pre-date the client GPFS cluster 
deployment. Since deploying GPFS clients onto the one supercomputer, we have 
been experiencing a great deal of flapping in our CTDB layer. It's difficult to 
sort out what is causing what, but I can identify a handful of the symptoms 
that we're seeing:

1) In the CTDB logs of all the NFS server nodes, we see numerous complaints (on 
some nodes this is multiple times a day) that rpc.mountd is not running and is 
being restarted, i.e.,

"ERROR: MOUNTD is not running. Trying to restart it."

2) In syslog, rpc.mountd can be seen complaining that it is unable to bind to a 
socket and that an address is already in use, i.e.,

"rpc.mountd[16869]: Could not bind socket: (98) Address already in use"

The rpc.mountd daemon on these nodes is manually constrained to use port 597. 
The mountd daemon seems able to listen for UDP connections on this port, but 
not for TCP connections. However, investigating `lsof` and `netstat` reveals no 
process that is using port 597 and preventing rpc.mountd from using it.

3) We also see nfsd failing its CTDB health check several times a day, i.e.,

"Event script timed out : 60.nfs monitor count : 0 pid : 7172"

Both the non-running state of rpc.mountd and the failure of nfsd to pass its 
CTDB health checks are causing multiple nodes in the NFS export cluster to 
become "UNHEALTHY" (the CTDB designation for it) multiple times a day, 
resulting in a lot of flapping and passing IP addresses back and forth.

I should mention here that nfsd on these nodes was running without any problems 
for the last month up until the night when we deployed the GPFS client cluster. 
After that deployment, the host of problems listed above suddenly started up. I 
should also mention that the new client GPFS cluster is running quite nicely, 
although it is generating a lot more open network sockets on the core-cluster 
side. We believe that the NFS problems starting at the same time as the GPFS 
client deployment is not a coincidence, and are inclined to conclude that 
something about deploying GPFS clients on the supercomputer in question is 
destabilizing the NFS instances running on the clients that belong to the core 
cluster.

Our current hypothesis is that introducing all of these new GPFS clients has 
caused contention for some resource on the core-cluster client nodes (ports?, 
open file handles?, something else?) and GPFS is winning out over NFS.

Does anyone have experience with running NFS and GPFS together in such an 
environment, especially with CTDB as a high-availability daemon? Has anyone 
perhaps seen these kinds of problems before or have any ideas as to what may be 
causing them?

We're happy to provide any additional diagnostics that the group would like to 
see in order to investigate. As always, we very much appreciate any help that 
you are able to provide.

Thank you so much!

Stewart Howard
Indiana University
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to