Hi Karthik, Many thanks for the response! On 4 July 2018 at 05:26, Karthik Subrahmanya <ksubr...@redhat.com> wrote:
> Hi, > > From the logs you have pasted it looks like those files are in GFID > split-brain. > They should have the GFIDs assigned on both the data bricks but they will > be different. > > Can you please paste the getfattr output of those files and their parent > from all the bricks again? > The files don't have any attributes set, however I did manage to find their corresponding entries in .glusterfs ================================== [root@v0 .glusterfs]# getfattr -m . -d -e hex /gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent getfattr: Removing leading '/' from absolute path names # file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.engine-client-2=0x0000000000000000000024ea trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2 trusted.glusterfs.dht=0x000000010000000000000000ffffffff [root@v0 .glusterfs]# getfattr -m . -d -e hex /gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/* getfattr: Removing leading '/' from absolute path names # file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.lockspace security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000 # file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.metadata security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000 [root@v0 .glusterfs]# ls -l /gluster/engine/brick/.glusterfs/db/9a/db9afb92-d2bc-49ed-8e34-dcd437ba7be2/ total 0 lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.lockspace -> /var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/2502aff4-6c67-4643-b681-99f2c87e793d/03919182-6be2-4cbc-aea2-b9d68422a800 lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.metadata -> /var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/99510501-6bdc-485a-98e8-c2f82ff8d519/71fa7e6c-cdfb-4da8-9164-2404b518d0ee ================================== Again, here are the relevant client log entries: [2018-07-03 19:09:29.245089] W [MSGID: 108008] [afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check] 0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-dcd437ba7be2>/hosted-engine.metadata 5e95ba8c-2f12-49bf-be2d-b4baf210d366 on engine-client-1 and b9cd7613-3b96-415d-a549-1dc788a4f94d on engine-client-0 [2018-07-03 19:09:29.245585] W [fuse-bridge.c:471:fuse_entry_cbk] 0-glusterfs-fuse: 10430040: LOOKUP() /98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.metadata => -1 (Input/output error) [2018-07-03 19:09:30.619000] W [MSGID: 108008] [afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check] 0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-dcd437ba7be2>/hosted-engine.lockspace 8e86902a-c31c-4990-b0c5-0318807edb8f on engine-client-1 and e5899a4c-dc5d-487e-84b0-9bbc73133c25 on engine-client-0 [2018-07-03 19:09:30.619360] W [fuse-bridge.c:471:fuse_entry_cbk] 0-glusterfs-fuse: 10430656: LOOKUP() /98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.lockspace => -1 (Input/output error) [root@v0 .glusterfs]# find . -type f | grep -E "5e95ba8c-2f12-49bf-be2d-b4baf210d366|8e86902a-c31c-4990-b0c5-0318807edb8f|b9cd7613-3b96-415d-a549-1dc788a4f94d|e5899a4c-dc5d-487e-84b0-9bbc73133c25" [root@v0 .glusterfs]# ================================== > Which version of gluster you are using? > 3.8.5 An upgrade is on the books, however I had to go back on my last attempt as 3.12 didn't work with 3.8 & I was unable to do a live rolling upgrade. Once I've got this GFID mess sorted out, I'll give a full upgrade a go as I've already had to failover this cluster's services to another cluster. If you are using a version higher than or equal to 3.12 gfid split brains > can be resolved using the methods (except method 4) > explained in the "Resolution of split-brain using gluster CLI" section in > [1]. > Also note that for gfid split-brain resolution using CLI you have to pass > the name of the file as argument and not the GFID. > > If it is lower than 3.12 (Please consider upgrading them since they are > EOL) you have to resolve it manually as explained in [2] > > [1] https://docs.gluster.org/en/latest/Troubleshooting/ > resolving-splitbrain/ > [2] https://docs.gluster.org/en/latest/Troubleshooting/ > resolving-splitbrain/#dir-split-brain > "The user needs to remove either file '1' on brick-a or the file '1' on brick-b to resolve the split-brain. In addition, the corresponding gfid-link file also needs to be removed." Okay, so as you can see above, the files don't have a trusted.gfid attribute, & on the brick I didn't find any files in .glusterfs with the same name as the GFID's reported in the client log. I did however find the symlinked files in a .glusterfs directory under the parent directory's GFID. [root@v0 .glusterfs]# ls -l /gluster/engine/brick/.glusterfs/db/9a/db9afb92-d2bc-49ed-8e34-dcd437ba7be2/ total 0 lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.lockspace -> /var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/2502aff4-6c67-4643-b681-99f2c87e793d/03919182-6be2-4cbc-aea2-b9d68422a800 lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.metadata -> /var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/99510501-6bdc-485a-98e8-c2f82ff8d519/71fa7e6c-cdfb-4da8-9164-2404b518d0ee So if I delete those two symlinks & the files they point to, on one of the two bricks, will that resolve the split brain? Is that correct? > Thanks & Regards, > Karthik > > On Wed, Jul 4, 2018 at 1:59 AM Gambit15 <dougti+glus...@gmail.com> wrote: > >> On 1 July 2018 at 22:37, Ashish Pandey <aspan...@redhat.com> wrote: >> >>> >>> The only problem at the moment is that arbiter brick offline. You should >>> only bother about completion of maintenance of arbiter brick ASAP. >>> Bring this brick UP, start FULL heal or index heal and the volume will >>> be in healthy state. >>> >> >> Doesn't the arbiter only resolve split-brain situations? None of the >> files that have been marked for healing are marked as in split-brain. >> >> The arbiter has now been brought back up, however the problem continues. >> >> I've found the following information in the client log: >> >> [2018-07-03 19:09:29.245089] W [MSGID: 108008] >> [afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check] >> 0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34- >> dcd437ba7be2>/hosted-engine.metadata 5e95ba8c-2f12-49bf-be2d-b4baf210d366 >> on engine-client-1 and b9cd7613-3b96-415d-a549-1dc788a4f94d on >> engine-client-0 >> [2018-07-03 19:09:29.245585] W [fuse-bridge.c:471:fuse_entry_cbk] >> 0-glusterfs-fuse: 10430040: LOOKUP() /98495dbc-a29c-4893-b6a0- >> 0aa70860d0c9/ha_agent/hosted-engine.metadata => -1 (Input/output error) >> [2018-07-03 19:09:30.619000] W [MSGID: 108008] >> [afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check] >> 0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34- >> dcd437ba7be2>/hosted-engine.lockspace 8e86902a-c31c-4990-b0c5-0318807edb8f >> on engine-client-1 and e5899a4c-dc5d-487e-84b0-9bbc73133c25 on >> engine-client-0 >> [2018-07-03 19:09:30.619360] W [fuse-bridge.c:471:fuse_entry_cbk] >> 0-glusterfs-fuse: 10430656: LOOKUP() /98495dbc-a29c-4893-b6a0- >> 0aa70860d0c9/ha_agent/hosted-engine.lockspace => -1 (Input/output error) >> >> As you can see from the logs I posted previously, neither of those two >> files, on either of the two servers, have any of gluster's extended >> attributes set. >> >> The arbiter doesn't have any record of the files in question, as they >> were created after it went offline. >> >> How do I fix this? Is it possible to locate the correct gfids somewhere & >> redefine them on the files manually? >> >> Cheers, >> Doug >> >> ------------------------------ >>> *From: *"Gambit15" <dougti+glus...@gmail.com> >>> *To: *"Ashish Pandey" <aspan...@redhat.com> >>> *Cc: *"gluster-users" <gluster-users@gluster.org> >>> *Sent: *Monday, July 2, 2018 1:45:01 AM >>> *Subject: *Re: [Gluster-users] Files not healing & missing their >>> extended attributes - Help! >>> >>> >>> Hi Ashish, >>> >>> The output is below. It's a rep 2+1 volume. The arbiter is offline for >>> maintenance at the moment, however quorum is met & no files are reported as >>> in split-brain (it hosts VMs, so files aren't accessed concurrently). >>> >>> ====================== >>> [root@v0 glusterfs]# gluster volume info engine >>> >>> Volume Name: engine >>> Type: Replicate >>> Volume ID: 279737d3-3e5a-4ee9-8d4a-97edcca42427 >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 1 x (2 + 1) = 3 >>> Transport-type: tcp >>> Bricks: >>> Brick1: s0:/gluster/engine/brick >>> Brick2: s1:/gluster/engine/brick >>> Brick3: s2:/gluster/engine/arbiter (arbiter) >>> Options Reconfigured: >>> nfs.disable: on >>> performance.readdir-ahead: on >>> transport.address-family: inet >>> performance.quick-read: off >>> performance.read-ahead: off >>> performance.io-cache: off >>> performance.stat-prefetch: off >>> cluster.eager-lock: enable >>> network.remote-dio: enable >>> cluster.quorum-type: auto >>> cluster.server-quorum-type: server >>> storage.owner-uid: 36 >>> storage.owner-gid: 36 >>> performance.low-prio-threads: 32 >>> >>> ====================== >>> >>> [root@v0 glusterfs]# gluster volume heal engine info >>> Brick s0:/gluster/engine/brick >>> /__DIRECT_IO_TEST__ >>> /98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent >>> /98495dbc-a29c-4893-b6a0-0aa70860d0c9 >>> <LIST TRUNCATED FOR BREVITY> >>> Status: Connected >>> Number of entries: 34 >>> >>> Brick s1:/gluster/engine/brick >>> <SAME AS ABOVE - TRUNCATED FOR BREVITY> >>> Status: Connected >>> Number of entries: 34 >>> >>> Brick s2:/gluster/engine/arbiter >>> Status: Ponto final de transporte não está conectado >>> Number of entries: - >>> >>> ====================== >>> === PEER V0 === >>> >>> [root@v0 glusterfs]# getfattr -m . -d -e hex /gluster/engine/brick/ >>> 98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent >>> getfattr: Removing leading '/' from absolute path names >>> # file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ >>> ha_agent >>> security.selinux=0x73797374656d5f753a6f626a6563 >>> 745f723a756e6c6162656c65645f743a733000 >>> trusted.afr.dirty=0x000000000000000000000000 >>> trusted.afr.engine-client-2=0x0000000000000000000024e8 >>> trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2 >>> trusted.glusterfs.dht=0x000000010000000000000000ffffffff >>> >>> [root@v0 glusterfs]# getfattr -m . -d -e hex /gluster/engine/brick/ >>> 98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/* >>> getfattr: Removing leading '/' from absolute path names >>> # file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ >>> ha_agent/hosted-engine.lockspace >>> security.selinux=0x73797374656d5f753a6f626a6563 >>> 745f723a6675736566735f743a733000 >>> >>> # file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ >>> ha_agent/hosted-engine.metadata >>> security.selinux=0x73797374656d5f753a6f626a6563 >>> 745f723a6675736566735f743a733000 >>> >>> === PEER V1 === >>> >>> [root@v1 glusterfs]# getfattr -m . -d -e hex /gluster/engine/brick/ >>> 98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent >>> getfattr: Removing leading '/' from absolute path names >>> # file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ >>> ha_agent >>> security.selinux=0x73797374656d5f753a6f626a6563 >>> 745f723a756e6c6162656c65645f743a733000 >>> trusted.afr.dirty=0x000000000000000000000000 >>> trusted.afr.engine-client-2=0x0000000000000000000024ec >>> trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2 >>> trusted.glusterfs.dht=0x000000010000000000000000ffffffff >>> >>> ====================== >>> >>> cmd_history.log-20180701: >>> >>> [2018-07-01 03:11:38.461175] : volume heal engine full : SUCCESS >>> [2018-07-01 03:11:51.151891] : volume heal data full : SUCCESS >>> >>> glustershd.log-20180701: >>> <LOGS FROM 06/01 TRUNCATED> >>> [2018-07-01 07:15:04.779122] I [MSGID: 100011] >>> [glusterfsd.c:1396:reincarnate] >>> 0-glusterfsd: Fetching the volume file from server... >>> >>> glustershd.log: >>> [2018-07-01 07:15:04.779693] I [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk] >>> 0-glusterfs: No change in volfile, continuing >>> >>> That's the *only* message in glustershd.log today. >>> >>> ====================== >>> >>> [root@v0 glusterfs]# gluster volume status engine >>> Status of volume: engine >>> Gluster process TCP Port RDMA Port >>> Online Pid >>> ------------------------------------------------------------ >>> ------------------ >>> Brick s0:/gluster/engine/brick 49154 0 >>> Y 2816 >>> Brick s1:/gluster/engine/brick 49154 0 >>> Y 3995 >>> Self-heal Daemon on localhost N/A N/A Y >>> 2919 >>> Self-heal Daemon on s1 N/A N/A Y >>> 4013 >>> >>> Task Status of Volume engine >>> ------------------------------------------------------------ >>> ------------------ >>> There are no active volume tasks >>> >>> ====================== >>> >>> Okay, so actually only the directory ha_agent is listed for healing (not >>> its contents), & that does have attributes set. >>> >>> Many thanks for the reply! >>> >>> >>> On 1 July 2018 at 15:34, Ashish Pandey <aspan...@redhat.com> wrote: >>> >>>> You have not even talked about the volume type and configuration and >>>> this issue would require lot of other information to fix it. >>>> >>>> 1 - What is the type of volume and config. >>>> 2 - Provide the gluster v <volname> info out put >>>> 3 - Heal info out put >>>> 4 - getxattr of one of the file, which needs healing, from all the >>>> bricks. >>>> 5 - What lead to the healing of file? >>>> 6 - gluster v <volname> status >>>> 7 - glustershd.log out put just after you run full heal or index heal >>>> >>>> ---- >>>> Ashish >>>> >>>> ------------------------------ >>>> *From: *"Gambit15" <dougti+glus...@gmail.com> >>>> *To: *"gluster-users" <gluster-users@gluster.org> >>>> *Sent: *Sunday, July 1, 2018 11:50:16 PM >>>> *Subject: *[Gluster-users] Files not healing & missing their >>>> extended attributes - Help! >>>> >>>> >>>> Hi Guys, >>>> I had to restart our datacenter yesterday, but since doing so a number >>>> of the files on my gluster share have been stuck, marked as healing. After >>>> no signs of progress, I manually set off a full heal last night, but after >>>> 24hrs, nothing's happened. >>>> >>>> The gluster logs all look normal, and there're no messages about failed >>>> connections or heal processes kicking off. >>>> >>>> I checked the listed files' extended attributes on their bricks today, >>>> and they only show the selinux attribute. There's none of the trusted.* >>>> attributes I'd expect. >>>> The healthy files on the bricks do have their extended attributes >>>> though. >>>> >>>> I'm guessing that perhaps the files somehow lost their attributes, and >>>> gluster is no longer able to work out what to do with them? It's not logged >>>> any errors, warnings, or anything else out of the normal though, so I've no >>>> idea what the problem is or how to resolve it. >>>> >>>> I've got 16 hours to get this sorted before the start of work, Monday. >>>> Help! >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users@gluster.org >>>> http://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users@gluster.org >>> http://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users@gluster.org >> http://lists.gluster.org/mailman/listinfo/gluster-users > >
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users