Hi! It's VMs based on KVM/qemu managed by libvirtd. I figured I could see the heal status by comparing the bricks: nothing was replicated, but new files were (after a long delay of about 5 mins). So I wanted to see if existing files (VM images) will be healed if I would stop a VM (close any open handle on the file), which turned out not to be the case.
I ended up shutting down all VMs and restarting the server. Afterwards healing worked as expected.... - Andreas On Mon, Oct 5, 2015 at 1:01 PM, Anuradha Talur <[email protected]> wrote: > > > ----- Original Message ----- > > From: "Andreas Mather" <[email protected]> > > To: "Anuradha Talur" <[email protected]> > > Cc: "[email protected] List" <[email protected]> > > Sent: Thursday, September 24, 2015 6:59:38 PM > > Subject: Re: [Gluster-users] gluster 3.7.3 - volume heal info hangs - > unknown heal status > > > > Hi Anuradha! > > > > Thanks for your reply! Attached you can find the dump files. As I'm not > > sure if they make their way through as attachments, here're links to them > > as well: > > > > brick1 - http://pastebin.com/3ivkhuRH > > brick2 - http://pastebin.com/77sT1mut > Hi, > > I see some blocked locks from the statedump. > Could you let me know what kind of workload you had when you observed the > hang? > > > > - Andreas > > > > > > > > > > On Thu, Sep 24, 2015 at 3:18 PM, Anuradha Talur <[email protected]> > wrote: > > > > > > > > > > > ----- Original Message ----- > > > > From: "Andreas Mather" <[email protected]> > > > > To: "[email protected] List" <[email protected]> > > > > Sent: Thursday, September 24, 2015 1:24:12 PM > > > > Subject: [Gluster-users] gluster 3.7.3 - volume heal info hangs - > > > unknown heal status > > > > > > > > Hi! > > > > > > > > Our provider had network maintenance this night, so 2 of our 4 > servers > > > got > > > > disconnected and reconnected. Since we knew this was coming, we > shifted > > > all > > > > work load off the affected servers. This morning, most of the cluster > > > seems > > > > fine, but for one volume, no heal info can be retrieved, so we > basically > > > > don't know about the healing state of the volume. The volume is a > > > replica 2 > > > > volume between vhost4-int/brick1 and vhost3-int/brick2. > > > > > > > > The volume is accessible, but since I don't get any heal info, I > don't > > > know > > > > if it is probably replicated. Any help to resolve this situation is > > > highly > > > > appreciated. > > > > > > > > hangs forever: > > > > [root@vhost4 ~]# gluster volume heal vol4 info > > > > > > > > glfsheal-vol4.log: > > > > [2015-09-24 07:47:59.284723] I [MSGID: 101190] > > > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > > with > > > > index 1 > > > > [2015-09-24 07:47:59.293735] I [MSGID: 101190] > > > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started > thread > > > with > > > > index 2 > > > > [2015-09-24 07:47:59.294061] I [MSGID: 104045] > [glfs-master.c:95:notify] > > > > 0-gfapi: New graph 76686f73-7434-2e61-6c6c-61626f757461 (0) coming up > > > > [2015-09-24 07:47:59.294081] I [MSGID: 114020] [client.c:2118:notify] > > > > 0-vol4-client-1: parent translators are ready, attempting connect on > > > > transport > > > > [2015-09-24 07:47:59.309470] I [MSGID: 114020] [client.c:2118:notify] > > > > 0-vol4-client-2: parent translators are ready, attempting connect on > > > > transport > > > > [2015-09-24 07:47:59.310525] I [rpc-clnt.c:1819:rpc_clnt_reconfig] > > > > 0-vol4-client-1: changing port to 49155 (from 0) > > > > [2015-09-24 07:47:59.315958] I [MSGID: 114057] > > > > [client-handshake.c:1437:select_server_supported_programs] > > > 0-vol4-client-1: > > > > Using Program GlusterFS 3.3, Num (1298437), Version (330) > > > > [2015-09-24 07:47:59.316481] I [MSGID: 114046] > > > > [client-handshake.c:1213:client_setvolume_cbk] 0-vol4-client-1: > > > Connected to > > > > vol4-client-1, attached to remote volume '/storage/brick2/brick2'. > > > > [2015-09-24 07:47:59.316495] I [MSGID: 114047] > > > > [client-handshake.c:1224:client_setvolume_cbk] 0-vol4-client-1: > Server > > > and > > > > Client lk-version numbers are not same, reopening the fds > > > > [2015-09-24 07:47:59.316538] I [MSGID: 108005] > > > [afr-common.c:3960:afr_notify] > > > > 0-vol4-replicate-0: Subvolume 'vol4-client-1' came back up; going > online. > > > > [2015-09-24 07:47:59.317150] I [MSGID: 114035] > > > > [client-handshake.c:193:client_set_lk_version_cbk] 0-vol4-client-1: > > > Server > > > > lk version = 1 > > > > [2015-09-24 07:47:59.320898] I [rpc-clnt.c:1819:rpc_clnt_reconfig] > > > > 0-vol4-client-2: changing port to 49154 (from 0) > > > > [2015-09-24 07:47:59.325633] I [MSGID: 114057] > > > > [client-handshake.c:1437:select_server_supported_programs] > > > 0-vol4-client-2: > > > > Using Program GlusterFS 3.3, Num (1298437), Version (330) > > > > [2015-09-24 07:47:59.325780] I [MSGID: 114046] > > > > [client-handshake.c:1213:client_setvolume_cbk] 0-vol4-client-2: > > > Connected to > > > > vol4-client-2, attached to remote volume '/storage/brick1/brick1'. > > > > [2015-09-24 07:47:59.325791] I [MSGID: 114047] > > > > [client-handshake.c:1224:client_setvolume_cbk] 0-vol4-client-2: > Server > > > and > > > > Client lk-version numbers are not same, reopening the fds > > > > [2015-09-24 07:47:59.333346] I [MSGID: 114035] > > > > [client-handshake.c:193:client_set_lk_version_cbk] 0-vol4-client-2: > > > Server > > > > lk version = 1 > > > > [2015-09-24 07:47:59.334545] I [MSGID: 108031] > > > > [afr-common.c:1745:afr_local_discovery_cbk] 0-vol4-replicate-0: > selecting > > > > local read_child vol4-client-2 > > > > [2015-09-24 07:47:59.335833] I [MSGID: 104041] > > > > [glfs-resolve.c:862:__glfs_active_subvol] 0-vol4: switched to graph > > > > 76686f73-7434-2e61-6c6c-61626f757461 (0) > > > > > > > > Questions to this output: > > > > -) Why does it report " Using Program GlusterFS 3.3, Num (1298437), > > > Version > > > > (330) ". We run 3.7.3 ?! > > > > -) guster logs timestamps in UTC not taking server timezone into > > > account. Is > > > > there a way to fix this? > > > > > > > > etc-glusterfs-glusterd.vol.log: > > > > no logs to after volume heal info command > > > > > > > > storage-brick1-brick1.log: > > > > [2015-09-24 07:47:59.325720] I [login.c:81:gf_auth] 0-auth/login: > allowed > > > > user names: 67ef1559-d3a1-403a-b8e7-fb145c3acf4e > > > > [2015-09-24 07:47:59.325743] I [MSGID: 115029] > > > > [server-handshake.c:610:server_setvolume] 0-vol4-server: accepted > client > > > > from > > > > > vhost4.allaboutapps.at-14900-2015/09/24-07:47:59:282313-vol4-client-2-0-0 > > > > (version: 3.7.3) > > > > > > > > storage-brick2-brick2.log: > > > > no logs to after volume heal info command > > > > > > > > > > > Hi Andreas, > > > > > > Could you please provide the following information so that we can > > > understand why the command is hanging? > > > When the command is hung, run the following command from one of the > > > servers: > > > `gluster volume statedump <volname>` > > > This command will generate statedumps of glusterfsd processes in the > > > servers. You can find them at /var/run/gluster . A typical statedump > for a > > > brick has "<brick-path>.<pid-of-brick>.dump.<timestamp>" as its name. > Could > > > you please attach them and respond? > > > > > > > Thanks, > > > > > > > > - Andreas > > > > > > > > > > > > > > > > _______________________________________________ > > > > Gluster-users mailing list > > > > [email protected] > > > > http://www.gluster.org/mailman/listinfo/gluster-users > > > > > > -- > > > Thanks, > > > Anuradha. > > > > > > > -- > Thanks, > Anuradha. >
_______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
