Hello,

At first a short introduction. My name is Jaap Jan Ouwehand, I work at a Dutch 
hospital "VU Medical Center" in Amsterdam. We make daily use of IBM Spectrum 
Scale, Spectrum Archive and Spectrum Protect in our critical (office, research 
and clinical data) business process. We have three large GPFS filesystems for 
different purposes.

We also had such a situation with cNFS. A failover (IPtakeover) was technically 
good, only clients experienced "stale filehandles". We opened a PMR at IBM and 
after testing, deliver logs, tcpdumps and a few months later, the solution 
appeared to be in the fsid option.

An NFS filehandle is built by a combination of fsid and a hash function on the 
inode. After a failover, the fsid value can be different and the client has a 
"stale filehandle". To avoid this, the fsid value can be statically specified. 
See:

https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adm_nfslin.htm

Maybe there is also a value in Ganesha that changes after a failover. Certainly 
since most sessions will be re-established after a failback. Maybe you see more 
debug information with tcpdump.


Kind regards,
 
Jaap Jan Ouwehand
ICT Specialist (Storage & Linux)
VUmc - ICT
E: [email protected]
W: www.vumc.com



-----Oorspronkelijk bericht-----
Van: [email protected] 
[mailto:[email protected]] Namens Simon Thompson (IT 
Research Support)
Verzonden: dinsdag 25 april 2017 13:21
Aan: [email protected]
Onderwerp: [gpfsug-discuss] NFS issues

Hi,

We have recently started deploying NFS in addition our existing SMB exports on 
our protocol nodes.

We use a RR DNS name that points to 4 VIPs for SMB services and failover seems 
to work fine with SMB clients. We figured we could use the same name and IPs 
and run Ganesha on the protocol servers, however we are seeing issues with NFS 
clients when IP failover occurs.

In normal operation on a client, we might see several mounts from different IPs 
obviously due to the way the DNS RR is working, but it all works fine.

In a failover situation, the IP will move to another node and some clients will 
carry on, others will hang IO to the mount points referred to by the IP which 
has moved. We can *sometimes* trigger this by manually suspending a CES node, 
but not always and some clients mounting from the IP moving will be fine, 
others won't.

If we resume a node an it fails back, the clients that are hanging will usually 
recover fine. We can reboot a client prior to failback and it will be fine, 
stopping and starting the ganesha service on a protocol node will also 
sometimes resolve the issues.

So, has anyone seen this sort of issue and any suggestions for how we could 
either debug more or workaround?

We are currently running the packages
nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones).

At one point we were seeing it a lot, and could track it back to an underlying 
GPFS network issue that was causing protocol nodes to be expelled occasionally, 
we resolved that and the issues became less apparent, but maybe we just fixed 
one failure mode so see it less often.

On the clients, we use -o sync,hard BTW as in the IBM docs.

On a client showing the issues, we'll see in dmesg, NFS related messages
like:
[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not responding, 
timed out

Which explains the client hang on certain mount points.

The symptoms feel very much like those logged in this Gluster/ganesha bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1354439


Thanks

Simon

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to