Hi,
> Op 7 feb. 2019, om 18:51 heeft Soumya Koduri <[email protected]> het > volgende geschreven: > > On 2/7/19 6:01 PM, Maurits Lamers wrote: >> Hi all, >> I am trying to find out more about why a nfs mount through nfs-ganesha of a >> glusterfs volume freezes. >> Little bit of a background: >> The system consists of one glusterfs volume across 5 nodes. Every node runs >> Ubuntu 16.04, gluster 4.1.7 and nfs-ganesha 2.7.1 >> The gluster volume is exported using the setup described on the first half >> of >> https://docs.gluster.org/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/ >> The node which freezes is running Nextcloud in a docker setup, where the >> entire application is stored on a path, which is a nfs-ganesha mount of the >> glusterfs volume. >> When I am running a synchronisation operation with this nextcloud instance, >> at some point the entire system freezes. The only solution is to completely >> restart the node, >> Just before this freeze the /var/log/ganesha/ganesha-gfapi.log file contains >> an error, which seems to result to timeouts after a short while. >> The node running the nextcloud instance is the only one freezing, the rest >> of the cluster seems to not be affected. >> 2019-02-07 10:11:17.342132] W [dict.c:671:dict_ref] >> (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) >> [0x7f2f035139b4] >> -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) >> [0x7f2f037242cd] >> -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) >> [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] >> [2019-02-07 10:11:17.345776] W [dict.c:671:dict_ref] >> (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) >> [0x7f2f035139b4] >> -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) >> [0x7f2f037242cd] >> -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) >> [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] >> [2019-02-07 10:11:17.346079] W [dict.c:671:dict_ref] >> (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) >> [0x7f2f035139b4] >> -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) >> [0x7f2f037242cd] >> -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) >> [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] >> [2019-02-07 10:11:17.396853] W [dict.c:671:dict_ref] >> (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) >> [0x7f2f035139b4] >> -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) >> [0x7f2f037242cd] >> -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) >> [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] >> [2019-02-07 10:11:17.397650] W [dict.c:671:dict_ref] >> (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) >> [0x7f2f035139b4] >> -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) >> [0x7f2f037242cd] >> -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) >> [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] >> [2019-02-07 10:11:17.398036] W [dict.c:671:dict_ref] >> (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) >> [0x7f2f035139b4] >> -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) >> [0x7f2f037242cd] >> -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) >> [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] >> [2019-02-07 10:11:17.407839] W [dict.c:671:dict_ref] >> (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) >> [0x7f2f035139b4] >> -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) >> [0x7f2f037242cd] >> -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) >> [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] > > There is a patch [1] submitted and under review which fixes above error > messages. > >> [2019-02-07 10:11:24.812606] E [MSGID: 104055] >> [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall >> event_type(1) and gfid(yøêÙ >> MzîSL4_@) failed >> [2019-02-07 10:11:24.819376] E [MSGID: 104055] >> [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall >> event_type(1) and gfid(eTnôEU«H.<Y4failed >> [2019-02-07 10:11:24.833299] E [MSGID: 104055] >> [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall >> event_type(1) and gfid(gÇLÁèFà»0bЯk) failed >> [2019-02-07 10:25:01.642509] C >> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-2: server >> [node1]:49152 has not responded in the last 42 seconds, disconnecting. >> [2019-02-07 10:25:01.642805] C >> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-1: server >> [node2]:49152 has not responded in the last 42 seconds, disconnecting. >> [2019-02-07 10:25:01.642946] C >> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-4: server >> [node3]:49152 has not responded in the last 42 seconds, disconnecting. >> [2019-02-07 10:25:02.643120] C >> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-3: server >> 127.0.1.1:49152 has not responded in the last 42 seconds, disconnecting. >> [2019-02-07 10:25:02.643314] C >> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-0: server >> [node4]:49152 has not responded in the last 42 seconds, disconnecting. > > Strange that synctask failed. Could you please turn off > features.cache-invalidation volume option and check if the issue still > persists. Will try to do this. > > Also next time the system freezes, please check following: > > 1) if all the brick servers of that volume are up and running. Sometimes if > the brick servers are not reachable, it takes a while for outstanding > requests to timeout and get application (nfs-ganesha) back to normal state. > Please wait for a while and check if the mount becomes accessible. All brick servers are up and running. I have had this freeze already a few times, and the rest of the system continues to work, both nfs-ganesha mounts on different nodes as well as glusterfs mounts on those nodes. > > 2) check if nfs-ganesha server is responding to other requests - > #showmount -e localhost (on the node server is running) > # try mount and I/Os from any other client. I don't think it is, but it seems to be that the NFS system itself is simply waiting on the volume to become available again, which it never does. I will try to check though. > > 3) and if the ganesha server isnt responding to any client, please try > collecting 2-3 cores/stack traces of nfs-ganesha server with brief interval > (say 5min) in between using below commands . This is to check if any threads > are stuck in deadlock. > > to collect core: > #gcore <nfs-ganesha-pid> > > to collect stack trace: > #gstack <nfs-ganesha-pid> > > Thanks, > Soumya And also thanks! cheers Maurits > > [1] https://review.gluster.org/#/c/glusterfs/+/22126/ > <https://review.gluster.org/#/c/glusterfs/+/22126/> >> The log in /var/log/glusterfs/glusterd.log contains nothing after the >> default messages at startup. >> The log in /var/log/ganesha/ganesha.log starts warning about its health >> after the messages above: >> 07/02/2019 09:42:17 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[main] >> nfs_start :NFS STARTUP :EVENT >> :------------------------------------------------- >> 07/02/2019 09:42:17 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[main] >> nfs_start :NFS STARTUP :EVENT : NFS SERVER INITIALIZED >> 07/02/2019 09:42:17 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[main] >> nfs_start :NFS STARTUP :EVENT >> :------------------------------------------------- >> 07/02/2019 09:43:28 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[reaper] >> nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE >> 07/02/2019 11:26:17 : epoch 5c5bef56 : [myhost] : >> ganesha.nfsd-1384[dbus_heartbeat] nfs_health :DBUS :WARN :Health status is >> unhealthy. enq new: 170321, old: 170320; deq new: 170314, old: 170314 >> 07/02/2019 11:28:17 : epoch 5c5bef56 : [myhost] : >> ganesha.nfsd-1384[dbus_heartbeat] nfs_health :DBUS :WARN :Health status is >> unhealthy. enq new: 170325, old: 170324; deq new: 170317, old: 170317 >> 07/02/2019 11:29:21 : epoch 5c5bef56 : [myhost] : >> ganesha.nfsd-1384[dbus_heartbeat] nfs_health :DBUS :WARN :Health status is >> unhealthy. enq new: 170327, old: 170326; deq new: 170318, old: 170318 >> Any hints on how to solve this would be greatly appreciated. >> cheers >> Maurits >> _______________________________________________ >> Gluster-users mailing list >> [email protected] <mailto:[email protected]> >> https://lists.gluster.org/mailman/listinfo/gluster-users >> <https://lists.gluster.org/mailman/listinfo/gluster-users>
_______________________________________________ Gluster-users mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-users
