I *think* I've seen this, and that we then had open TCP connection from client to NFS server according to netstat, but these connections were not visible from netstat on NFS-server side.
Unfortunately I don't remember what the fix was.. -jf tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) < [email protected]>: > Hi, > > From what I can see, Ganesha uses the Export_Id option in the config file > (which is managed by CES) for this. I did find some reference in the > Ganesha devs list that if its not set, then it would read the FSID from > the GPFS file-system, either way they should surely be consistent across > all the nodes. The posts I found were from someone with an IBM email > address, so I guess someone in the IBM teams. > > I checked a couple of my protocol nodes and they use the same Export_Id > consistently, though I guess that might not be the same as the FSID value. > > Perhaps someone from IBM could comment on if FSID is likely to the cause > of my problems? > > Thanks > > Simon > > On 25/04/2017, 14:51, "[email protected] on behalf > of Ouwehand, JJ" <[email protected] on behalf of > [email protected]> wrote: > > >Hello, > > > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at a > >Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM > >Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical > >(office, research and clinical data) business process. We have three > >large GPFS filesystems for different purposes. > > > >We also had such a situation with cNFS. A failover (IPtakeover) was > >technically good, only clients experienced "stale filehandles". We opened > >a PMR at IBM and after testing, deliver logs, tcpdumps and a few months > >later, the solution appeared to be in the fsid option. > > > >An NFS filehandle is built by a combination of fsid and a hash function > >on the inode. After a failover, the fsid value can be different and the > >client has a "stale filehandle". To avoid this, the fsid value can be > >statically specified. See: > > > >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum > . > >scale.v4r22.doc/bl1adm_nfslin.htm > > > >Maybe there is also a value in Ganesha that changes after a failover. > >Certainly since most sessions will be re-established after a failback. > >Maybe you see more debug information with tcpdump. > > > > > >Kind regards, > > > >Jaap Jan Ouwehand > >ICT Specialist (Storage & Linux) > >VUmc - ICT > >E: [email protected] > >W: www.vumc.com > > > > > > > >-----Oorspronkelijk bericht----- > >Van: [email protected] > >[mailto:[email protected]] Namens Simon Thompson > >(IT Research Support) > >Verzonden: dinsdag 25 april 2017 13:21 > >Aan: [email protected] > >Onderwerp: [gpfsug-discuss] NFS issues > > > >Hi, > > > >We have recently started deploying NFS in addition our existing SMB > >exports on our protocol nodes. > > > >We use a RR DNS name that points to 4 VIPs for SMB services and failover > >seems to work fine with SMB clients. We figured we could use the same > >name and IPs and run Ganesha on the protocol servers, however we are > >seeing issues with NFS clients when IP failover occurs. > > > >In normal operation on a client, we might see several mounts from > >different IPs obviously due to the way the DNS RR is working, but it all > >works fine. > > > >In a failover situation, the IP will move to another node and some > >clients will carry on, others will hang IO to the mount points referred > >to by the IP which has moved. We can *sometimes* trigger this by manually > >suspending a CES node, but not always and some clients mounting from the > >IP moving will be fine, others won't. > > > >If we resume a node an it fails back, the clients that are hanging will > >usually recover fine. We can reboot a client prior to failback and it > >will be fine, stopping and starting the ganesha service on a protocol > >node will also sometimes resolve the issues. > > > >So, has anyone seen this sort of issue and any suggestions for how we > >could either debug more or workaround? > > > >We are currently running the packages > >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > > > >At one point we were seeing it a lot, and could track it back to an > >underlying GPFS network issue that was causing protocol nodes to be > >expelled occasionally, we resolved that and the issues became less > >apparent, but maybe we just fixed one failure mode so see it less often. > > > >On the clients, we use -o sync,hard BTW as in the IBM docs. > > > >On a client showing the issues, we'll see in dmesg, NFS related messages > >like: > >[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not > >responding, timed out > > > >Which explains the client hang on certain mount points. > > > >The symptoms feel very much like those logged in this Gluster/ganesha bug: > >https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > > > > > >Thanks > > > >Simon > > > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
