[Gluster-users] VM freeze issue on simple gluster setup.

WK Thu, 05 Dec 2019 16:30:44 -0800

I have a replica2+arbiter setup that is used for VMs.


ip #.1 is the arb

ip #.2 and #.3 are the kvm hosts.

Two Volumes are involved and its gluster 6.5/Ubuntu 18.4/fuse TheGluster networking uses a two ethernet card teamd/round-robin setupwhich *should* have stayed up if one of the ports had failed.

I just had a number of VMs go Read-Only due to the below communicationfailure at 22:00 but only on kvm host #2


VMs on the same gluster volumes on kvm host 3 were unaffected.

The logs on host #2 show the following:

[2019-12-05 22:00:43.739804] C[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-2:server 10.255.1.1:49153 has not responded in the last 21 seconds,disconnecting.[2019-12-05 22:00:43.757095] C[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-1:server 10.255.1.3:49152 has not responded in the last 21 seconds,disconnecting.[2019-12-05 22:00:43.757191] I [MSGID: 114018][client.c:2323:client_rpc_notify] 0-GL1image-client-2: disconnected fromGL1image-client-2. Client process will keep trying to connect toglusterd until brick's port is available[2019-12-05 22:00:43.757246] I [MSGID: 114018][client.c:2323:client_rpc_notify] 0-GL1image-client-1: disconnected fromGL1image-client-1. Client process will keep trying to connect toglusterd until brick's port is available[2019-12-05 22:00:43.757266] W [MSGID: 108001][afr-common.c:5608:afr_notify] 0-GL1image-replicate-0: Client-quorum isnot met[2019-12-05 22:00:43.790639] E [rpc-clnt.c:346:saved_frames_unwind] (-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce](-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890]))))) 0-GL1image-client-2: forced unwinding frame type(GlusterFS 4.x v1)op(FXATTROP(34)) called at 2019-12-05 22:00:19.736456 (xid=0x825bffb)[2019-12-05 22:00:43.790655] W [MSGID: 114031][client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 0-GL1image-client-2:remote operation failed[2019-12-05 22:00:43.790686] E [rpc-clnt.c:346:saved_frames_unwind] (-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce](-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890]))))) 0-GL1image-client-1: forced unwinding frame type(GlusterFS 4.x v1)op(FXATTROP(34)) called at 2019-12-05 22:00:19.736428 (xid=0x89fee01)[2019-12-05 22:00:43.790703] W [MSGID: 114031][client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 0-GL1image-client-1:remote operation failed[2019-12-05 22:00:43.790774] E [MSGID: 114031][client-rpc-fops_v2.c:1393:client4_0_finodelk_cbk] 0-GL1image-client-1:remote operation failed [Transport endpoint is not connected][2019-12-05 22:00:43.790777] E [rpc-clnt.c:346:saved_frames_unwind] (-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce](-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890]))))) 0-GL1image-client-2: forced unwinding frame type(GlusterFS 4.x v1)op(FXATTROP(34)) called at 2019-12-05 22:00:19.736542 (xid=0x825bffc)[2019-12-05 22:00:43.790794] W [MSGID: 114029][client-rpc-fops_v2.c:4873:client4_0_finodelk] 0-GL1image-client-1:failed to send the fop[2019-12-05 22:00:43.790806] W [MSGID: 114031][client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 0-GL1image-client-2:remote operation failed[2019-12-05 22:00:43.790825] E [MSGID: 114031][client-rpc-fops_v2.c:1393:client4_0_finodelk_cbk] 0-GL1image-client-2:remote operation failed [Transport endpoint is not connected][2019-12-05 22:00:43.790842] W [MSGID: 114029][client-rpc-fops_v2.c:4873:client4_0_finodelk] 0-GL1image-client-2:failed to send the fop

the fop/transport not connected errors just repeat for another 50 linesor so until I hit 22:00:46 seconds at which point the Volumes appear tobe fine (though the VMs were still read-only until I rebooted.

[2019-12-05 22:00:46.987242] W [fuse-bridge.c:2827:fuse_readv_cbk]0-glusterfs-fuse: 91701328: READ => -1gfid=d883b7c4-97f5-4f12-9373-7987cfc7dee4 fd=0x7f02f005b708 (Transportendpoint is not connected)[2019-12-05 22:00:47.029947] W [fuse-bridge.c:2827:fuse_readv_cbk]0-glusterfs-fuse: 91701329: READ => -1gfid=d883b7c4-97f5-4f12-9373-7987cfc7dee4 fd=0x7f02f005b708 (Transportendpoint is not connected)[2019-12-05 22:00:49.901075] W [fuse-bridge.c:2827:fuse_readv_cbk]0-glusterfs-fuse: 91701330: READ => -1gfid=c342dba6-a2a2-49a8-be3f-cd320e90c956 fd=0x7f02f002bee8 (Transportendpoint is not connected)[2019-12-05 22:00:49.923525] W [fuse-bridge.c:2827:fuse_readv_cbk]0-glusterfs-fuse: 91701331: READ => -1gfid=c342dba6-a2a2-49a8-be3f-cd320e90c956 fd=0x7f02f002bee8 (Transportendpoint is not connected)[2019-12-05 22:00:49.970219] W [fuse-bridge.c:2827:fuse_readv_cbk]0-glusterfs-fuse: 91701332: READ => -1gfid=fcec6b7a-ad23-4449-aa09-107e113877a1 fd=0x7f02f008dd58 (Transportendpoint is not connected)[2019-12-05 22:00:50.023932] W [fuse-bridge.c:2827:fuse_readv_cbk]0-glusterfs-fuse: 91701333: READ => -1gfid=fcec6b7a-ad23-4449-aa09-107e113877a1 fd=0x7f02f008dd58 (Transportendpoint is not connected)[2019-12-05 22:00:54.807833] I [rpc-clnt.c:2028:rpc_clnt_reconfig]0-GL1image-client-2: changing port to 49153 (from 0)[2019-12-05 22:00:54.808043] I [rpc-clnt.c:2028:rpc_clnt_reconfig]0-GL1image-client-1: changing port to 49152 (from 0)[2019-12-05 22:00:46.115076] E [MSGID: 133014][shard.c:1799:shard_common_stat_cbk] 0-GL1image-shard: stat failed:7a5959d6-75fc-411d-8831-57a744776ed3 [Transport endpoint is not connected][2019-12-05 22:00:54.820394] I [MSGID: 114046][client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-1:Connected to GL1image-client-1, attached to remote volume'/GLUSTER/GL1image'.[2019-12-05 22:00:54.820447] I [MSGID: 114042][client-handshake.c:930:client_post_handshake] 0-GL1image-client-1: 10fds open - Delaying child_up until they are re-opened[2019-12-05 22:00:54.820549] I [MSGID: 114046][client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-2:Connected to GL1image-client-2, attached to remote volume'/GLUSTER/GL1image'.[2019-12-05 22:00:54.820568] I [MSGID: 114042][client-handshake.c:930:client_post_handshake] 0-GL1image-client-2: 10fds open - Delaying child_up until they are re-opened[2019-12-05 22:00:54.821381] I [MSGID: 114041][client-handshake.c:318:client_child_up_reopen_done]0-GL1image-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP[2019-12-05 22:00:54.821406] I [MSGID: 108002][afr-common.c:5602:afr_notify] 0-GL1image-replicate-0: Client-quorum is met[2019-12-05 22:00:54.821446] I [MSGID: 114041][client-handshake.c:318:client_child_up_reopen_done]0-GL1image-client-2: last fd open'd/lock-self-heal'd - notifying CHILD-UP

What is odd is that the gluster logs on the #3 and #1 show absolutelyZERO gluster errors around that time nor do I show any Network/teamderrors on any of the 3 nodes (including the problem node #2)


I've checked dmesg/syslog and every other log file on the box.

According to a staff member, we had this same kvm host have the sameproblem about 3 weeks ago, it was written up as a fluke possible due toexcess disk I/O, since we have been using gluster for years and rarelyhave seen issues, especially with very basic gluster usage.


In this case those VMs weren't overly busy and now we have a repeat problem.

So I am wondering where else I can look to diagnose the problem orshould I abandon the hardware/setup?

I assume its a networking issue and not on gluster, but I am confusedwhy gluster nodes #1 and #3 didn't complain about not seeing #2? If thenetworking did drop out should they have noticed?

There also doesn't appear to be any visible hard disk issues (smartd isrunning)

Side Note: I have reset the tcp-timeout back to 42 seconds and will lookat upgrading to 6.6. I also see that the ARB and the unaffected Glusternode were running Gluster 6.4 (I don't know why #2 is on 6.5 but I amchecking on that as well, we turn off auto-upgrade)


Maybe the mismatched versions are the culprit?

Also, we have a large of these replica 2+1 gluster setups runninggluster version from 5.x up and none of the others have had this issue


Any advise would be appreciated.

Sincerely,

Wk





________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] VM freeze issue on simple gluster setup.

Reply via email to