On Tue, Jan 29, 2019 at 12:11 PM Zhou, Cynthia (NSB - CN/Hangzhou) < [email protected]> wrote:
> Hi, > > We are using glusterfs version 3.12 for 3 brick I find that occasionally > after reboot all 3 sn nodes simultaneously, the glusterd process on one sn > nodes may get stuck, when you try to execute glusterd command it does not > response. > > > > Following is the glusterd stuck log and the gdb info of the stuck glusterd > process, > > [2019-01-28 14:38:24.999329] I [rpc-clnt.c:1048:rpc_clnt_connection_init] > 0-snapd: setting frame-timeout to 600 > > [2019-01-28 14:38:24.999450] I [rpc-clnt.c:1048:rpc_clnt_connection_init] > 0-snapd: setting frame-timeout to 600 > > [2019-01-28 14:38:24.999597] I [MSGID: 106493] > [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: > Received ACC from uuid: a4315121-e127-42ca-9869-0fe451216a80 > > [2019-01-28 14:38:24.999692] I [MSGID: 106493] > [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC > from uuid: 7f694d7e-7613-4298-8da1-50cbb73ed47e, host: mn-0.local, port: 0 > > [2019-01-28 14:38:25.010624] W [socket.c:593:__socket_rwv] 0-management: > readv on 192.168.1.14:24007 failed (No data available) > > [2019-01-28 14:38:25.010774] I [MSGID: 106004] > [glusterd-handler.c:6317:__glusterd_peer_rpc_notify] 0-management: Peer > <mn-0.local> (<7f694d7e-7613-4298-8da1-50cbb73ed47e>), in state <Peer in > Cluster>, has disconnected from glusterd. > > [2019-01-28 14:38:25.010860] W > [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) > [0x7f1828aa0ba4] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) > [0x7f1828ab3d5d] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) > [0x7f1828b7c635] ) 0-management: Lock for vol ccs not held > > [2019-01-28 14:38:25.010903] W [MSGID: 106118] > [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not > released for ccs > > [2019-01-28 14:38:25.010931] W > [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) > [0x7f1828aa0ba4] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) > [0x7f1828ab3d5d] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) > [0x7f1828b7c635] ) 0-management: Lock for vol encryptfile not held > > [2019-01-28 14:38:25.010945] W [MSGID: 106118] > [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not > released for encryptfile > > [2019-01-28 14:38:25.010971] W > [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) > [0x7f1828aa0ba4] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) > [0x7f1828ab3d5d] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) > [0x7f1828b7c635] ) 0-management: Lock for vol export not held > > [2019-01-28 14:38:25.010983] W [MSGID: 106118] > [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not > released for export > > [2019-01-28 14:38:25.011006] W > [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) > [0x7f1828aa0ba4] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) > [0x7f1828ab3d5d] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) > [0x7f1828b7c635] ) 0-management: Lock for vol log not held > > [2019-01-28 14:38:25.011046] W [MSGID: 106118] > [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not > released for log > > [2019-01-28 14:38:25.011070] W > [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) > [0x7f1828aa0ba4] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) > [0x7f1828ab3d5d] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) > [0x7f1828b7c635] ) 0-management: Lock for vol mstate not held > > [2019-01-28 14:38:25.011082] W [MSGID: 106118] > [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not > released for mstate > > [2019-01-28 14:38:25.011104] W > [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) > [0x7f1828aa0ba4] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) > [0x7f1828ab3d5d] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) > [0x7f1828b7c635] ) 0-management: Lock for vol services not held > > [2019-01-28 14:38:25.011115] W [MSGID: 106118] > [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not > released for services > > [2019-01-28 14:38:25.011268] E [rpc-clnt.c:350:saved_frames_unwind] (--> > /lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e9)[0x7f182df24a99] (--> > /lib64/libgfrpc.so.0(saved_frames_unwind+0x1f9)[0x7f182dce6f27] (--> > /lib64/libgfrpc.so.0(saved_frames_destroy+0x1f)[0x7f182dce701a] (--> > /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x11e)[0x7f182dce756e] > (--> /lib64/libgfrpc.so.0(+0x12f9e)[0x7f182dce7f9e] ))))) 0-management: > forced unwinding frame type(Peer mgmt) op(--(4)) called at 2019-01-28 > 14:38:24.999851 (xid=0x7) > > [2019-01-28 14:38:25.011286] E [MSGID: 106158] > [glusterd-rpc-ops.c:684:__glusterd_friend_update_cbk] 0-management: RPC > Error > > [2019-01-28 14:38:25.011301] I [MSGID: 106493] > [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: > Received RJT from uuid: 00000000-0000-0000-0000-000000000000 > > [2019-01-28 14:38:25.011424] I [MSGID: 106006] > [glusterd-svc-mgmt.c:328:glusterd_svc_common_rpc_notify] 0-management: > glustershd has connected with glusterd. > > [2019-01-28 14:38:25.011599] I [MSGID: 106143] > [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick > /mnt/bricks/ccs/brick on port 53952 > > [2019-01-28 14:38:25.011862] I [MSGID: 106492] [glusterd-handler.c:2718: > __glusterd_handle_friend_update] 0-glusterd: Received friend update from > uuid: 7f694d7e-7613-4298-8da1-50cbb73ed47e > > [2019-01-28 14:38:25.021058] I [MSGID: 106502] > [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: > Received my uuid as Friend > > [2019-01-28 14:38:25.021090] I [socket.c:3704:socket_submit_reply] > 0-socket.management: not connected (priv->connected = -1) > > [2019-01-28 14:38:25.021099] E [rpcsvc.c:1364:rpcsvc_submit_generic] > 0-rpc-service: failed to submit message (XID: 0xa, Program: GlusterD svc > peer, ProgVers: 2, Proc: 4) to rpc-transport (socket.management) > > [2019-01-28 14:38:25.021108] E [MSGID: 106430] > [glusterd-utils.c:568:glusterd_submit_reply] 0-glusterd: Reply submission > failed > > [2019-01-28 14:38:25.021126] E [rpcsvc.c:559:rpcsvc_check_and_reply_error] > 0-rpcsvc: rpc actor failed to complete successfully > > [2019-01-28 14:38:25.021135] E [rpcsvc.c:1364:rpcsvc_submit_generic] > 0-rpc-service: failed to submit message (XID: 0xa, Program: GlusterD svc > peer, ProgVers: 2, Proc: 4) to rpc-transport (socket.management) > > [2019-01-28 14:38:25.021147] W [rpcsvc.c:565:rpcsvc_check_and_reply_error] > 0-rpcsvc: failed to queue error reply > > > > > > Gdb info the stuck thread of glusterd is: > > (gdb) thread 8 > > [Switching to thread 8 (Thread 0x7f1826ec2700 (LWP 2418))] > > #0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0 > > (gdb) bt > > #0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0 > > #1 0x00007f182ccea677 in __lll_lock_elision () from /lib64/libpthread.so.0 > > #2 0x00007f182df5cae6 in iobref_unref () from /lib64/libglusterfs.so.0 > > #3 0x00007f182dce2f29 in rpc_transport_pollin_destroy () from > /lib64/libgfrpc.so.0 > > #4 0x00007f1827ccf319 in socket_event_poll_in () from > /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so > > #5 0x00007f1827ccf932 in socket_event_handler () from > /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so > > #6 0x00007f182df925d4 in event_dispatch_epoll_handler () from > /lib64/libglusterfs.so.0 > > #7 0x00007f182df928ab in event_dispatch_epoll_worker () from > /lib64/libglusterfs.so.0 > > #8 0x00007f182ccde5da in start_thread () from /lib64/libpthread.so.0 > > #9 0x00007f182c5b4e8f in clone () from /lib64/libc.so.6 > > (gdb) thread 9 > > [Switching to thread 9 (Thread 0x7f18266c1700 (LWP 2419))] > > #0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0 > > (gdb) bt > > #0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0 > > #1 0x00007f182cce2b42 in __pthread_mutex_cond_lock () from > /lib64/libpthread.so.0 > > #2 0x00007f182cce44c8 in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > > #3 0x00007f1827ccadab in socket_event_poll_err () from > /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so > > #4 0x00007f1827ccf99c in socket_event_handler () from > /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so > > #5 0x00007f182df925d4 in event_dispatch_epoll_handler () from > /lib64/libglusterfs.so.0 > > #6 0x00007f182df928ab in event_dispatch_epoll_worker () from > /lib64/libglusterfs.so.0 > > #7 0x00007f182ccde5da in start_thread () from /lib64/libpthread.so.0 > > #8 0x00007f182c5b4e8f in clone () from /lib64/libc.so.6 > > > > > > Have you ever encounter this issue? From the gdb info it seem the epoll > thread get dead lock. > Its not a deadlock. Pollerr is serialized with any ongoing pollin/pollout request processing. The problem here is iobuf_unref is waiting on a lock and hence pollin didn't complete. Question is why is iobuf_unref is stuck? Who has the lock of iobuf?
_______________________________________________ Gluster-users mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-users
