On Wed, Aug 31, 2016 at 12:59 AM, Krutika Dhananjay <[email protected]> wrote:
> Tried this. > > With me, only 'fake2' gets healed after i bring the 'empty' brick back up > and it stops there unless I do a 'heal-full'. > When you say heal-full is that a command I don't see when running gluster help or are you just abbreviating the 'gluster v heal <> full' line? > > Is that what you're seeing as well? > Yes and no. Right now on my prod even if running heal <> full nothing happens. My understanding is that a sweep should occur on each brick and it only occurs on the down node then no shard healing occurs. > > -Krutika > > On Wed, Aug 31, 2016 at 4:43 AM, David Gossage < > [email protected]> wrote: > >> Same issue brought up glusterd on problem node heal count still stuck at >> 6330. >> >> Ran gluster v heal GUSTER1 full >> >> glustershd on problem node shows a sweep starting and finishing in >> seconds. Other 2 nodes show no activity in log. They should start a sweep >> too shouldn't they? >> >> Tried starting from scratch >> >> kill -15 brickpid >> rm -Rf /brick >> mkdir -p /brick >> mkdir mkdir /gsmount/fake2 >> setfattr -n "user.some-name" -v "some-value" /gsmount/fake2 >> >> Heals visible dirs instantly then stops. >> >> gluster v heal GLUSTER1 full >> >> see sweep star on problem node and end almost instantly. no files added >> t heal list no files healed no more logging >> >> [2016-08-30 23:11:31.544331] I [MSGID: 108026] >> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >> starting full sweep on subvol GLUSTER1-client-1 >> [2016-08-30 23:11:33.776235] I [MSGID: 108026] >> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >> finished full sweep on subvol GLUSTER1-client-1 >> >> same results no matter which node you run command on. Still stuck with >> 6330 files showing needing healed out of 19k. still showing in logs no >> heals are occuring. >> >> Is their a way to forcibly reset any prior heal data? Could it be stuck >> on some past failed heal start? >> >> >> >> >> *David Gossage* >> *Carousel Checks Inc. | System Administrator* >> *Office* 708.613.2284 >> >> On Tue, Aug 30, 2016 at 10:03 AM, David Gossage < >> [email protected]> wrote: >> >>> On Tue, Aug 30, 2016 at 10:02 AM, David Gossage < >>> [email protected]> wrote: >>> >>>> updated test server to 3.8.3 >>>> >>>> Brick1: 192.168.71.10:/gluster2/brick1/1 >>>> Brick2: 192.168.71.11:/gluster2/brick2/1 >>>> Brick3: 192.168.71.12:/gluster2/brick3/1 >>>> Options Reconfigured: >>>> cluster.granular-entry-heal: on >>>> performance.readdir-ahead: on >>>> performance.read-ahead: off >>>> nfs.disable: on >>>> nfs.addr-namelookup: off >>>> nfs.enable-ino32: off >>>> cluster.background-self-heal-count: 16 >>>> cluster.self-heal-window-size: 1024 >>>> performance.quick-read: off >>>> performance.io-cache: off >>>> performance.stat-prefetch: off >>>> cluster.eager-lock: enable >>>> network.remote-dio: on >>>> cluster.quorum-type: auto >>>> cluster.server-quorum-type: server >>>> storage.owner-gid: 36 >>>> storage.owner-uid: 36 >>>> server.allow-insecure: on >>>> features.shard: on >>>> features.shard-block-size: 64MB >>>> performance.strict-o-direct: off >>>> cluster.locking-scheme: granular >>>> >>>> kill -15 brickpid >>>> rm -Rf /gluster2/brick3 >>>> mkdir -p /gluster2/brick3/1 >>>> mkdir mkdir /rhev/data-center/mnt/glusterSD/192.168.71.10 >>>> \:_glustershard/fake2 >>>> setfattr -n "user.some-name" -v "some-value" >>>> /rhev/data-center/mnt/glusterSD/192.168.71.10\:_glustershard/fake2 >>>> gluster v start glustershard force >>>> >>>> at this point brick process starts and all visible files including new >>>> dir are made on brick >>>> handful of shards are in heal statistics still but no .shard directory >>>> created and no increase in shard count >>>> >>>> gluster v heal glustershard >>>> >>>> At this point still no increase in count or dir made no additional >>>> activity in logs for healing generated. waited few minutes tailing logs to >>>> check if anything kicked in. >>>> >>>> gluster v heal glustershard full >>>> >>>> gluster shards added to list and heal commences. logs show full sweep >>>> starting on all 3 nodes. though this time it only shows as finishing on >>>> one which looks to be the one that had brick deleted. >>>> >>>> [2016-08-30 14:45:33.098589] I [MSGID: 108026] >>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-glustershard-replicate-0: >>>> starting full sweep on subvol glustershard-client-0 >>>> [2016-08-30 14:45:33.099492] I [MSGID: 108026] >>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-glustershard-replicate-0: >>>> starting full sweep on subvol glustershard-client-1 >>>> [2016-08-30 14:45:33.100093] I [MSGID: 108026] >>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-glustershard-replicate-0: >>>> starting full sweep on subvol glustershard-client-2 >>>> [2016-08-30 14:52:29.760213] I [MSGID: 108026] >>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-glustershard-replicate-0: >>>> finished full sweep on subvol glustershard-client-2 >>>> >>> >>> Just realized its still healing so that may be why sweep on 2 other >>> bricks haven't replied as finished. >>> >>>> >>>> >>>> my hope is that later tonight a full heal will work on production. Is >>>> it possible self-heal daemon can get stale or stop listening but still show >>>> as active? Would stopping and starting self-heal daemon from gluster cli >>>> before doing these heals be helpful? >>>> >>>> >>>> On Tue, Aug 30, 2016 at 9:29 AM, David Gossage < >>>> [email protected]> wrote: >>>> >>>>> On Tue, Aug 30, 2016 at 8:52 AM, David Gossage < >>>>> [email protected]> wrote: >>>>> >>>>>> On Tue, Aug 30, 2016 at 8:01 AM, Krutika Dhananjay < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 30, 2016 at 6:20 PM, Krutika Dhananjay < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Aug 30, 2016 at 6:07 PM, David Gossage < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> On Tue, Aug 30, 2016 at 7:18 AM, Krutika Dhananjay < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Could you also share the glustershd logs? >>>>>>>>>> >>>>>>>>> >>>>>>>>> I'll get them when I get to work sure >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> I tried the same steps that you mentioned multiple times, but >>>>>>>>>> heal is running to completion without any issues. >>>>>>>>>> >>>>>>>>>> It must be said that 'heal full' traverses the files and >>>>>>>>>> directories in a depth-first order and does heals also in the same >>>>>>>>>> order. >>>>>>>>>> But if it gets interrupted in the middle (say because >>>>>>>>>> self-heal-daemon was >>>>>>>>>> either intentionally or unintentionally brought offline and then >>>>>>>>>> brought >>>>>>>>>> back up), self-heal will only pick up the entries that are so far >>>>>>>>>> marked as >>>>>>>>>> new-entries that need heal which it will find in indices/xattrop >>>>>>>>>> directory. >>>>>>>>>> What this means is that those files and directories that were not >>>>>>>>>> visited >>>>>>>>>> during the crawl, will remain untouched and unhealed in this second >>>>>>>>>> iteration of heal, unless you execute a 'heal-full' again. >>>>>>>>>> >>>>>>>>> >>>>>>>>> So should it start healing shards as it crawls or not until after >>>>>>>>> it crawls the entire .shard directory? At the pace it was going that >>>>>>>>> could >>>>>>>>> be a week with one node appearing in the cluster but with no shard >>>>>>>>> files if >>>>>>>>> anything tries to access a file on that node. From my experience >>>>>>>>> other day >>>>>>>>> telling it to heal full again did nothing regardless of node used. >>>>>>>>> >>>>>>>> >>>>>>> Crawl is started from '/' of the volume. Whenever self-heal detects >>>>>>> during the crawl that a file or directory is present in some brick(s) >>>>>>> and >>>>>>> absent in others, it creates the file on the bricks where it is absent >>>>>>> and >>>>>>> marks the fact that the file or directory might need data/entry and >>>>>>> metadata heal too (this also means that an index is created under >>>>>>> .glusterfs/indices/xattrop of the src bricks). And the data/entry and >>>>>>> metadata heal are picked up and done in >>>>>>> >>>>>> the background with the help of these indices. >>>>>>> >>>>>> >>>>>> Looking at my 3rd node as example i find nearly an exact same number >>>>>> of files in xattrop dir as reported by heal count at time I brought down >>>>>> node2 to try and alleviate read io errors that seemed to occur from what >>>>>> I >>>>>> was guessing as attempts to use the node with no shards for reads. >>>>>> >>>>>> Also attached are the glustershd logs from the 3 nodes, along with >>>>>> the test node i tried yesterday with same results. >>>>>> >>>>> >>>>> Looking at my own logs I notice that a full sweep was only ever >>>>> recorded in glustershd.log on 2nd node with missing directory. I believe >>>>> I >>>>> should have found a sweep begun on every node correct? >>>>> >>>>> On my test dev when it did work I do see that >>>>> >>>>> [2016-08-30 13:56:25.223333] I [MSGID: 108026] >>>>> [afr-self-heald.c:646:afr_shd_full_healer] >>>>> 0-glustershard-replicate-0: starting full sweep on subvol >>>>> glustershard-client-0 >>>>> [2016-08-30 13:56:25.223522] I [MSGID: 108026] >>>>> [afr-self-heald.c:646:afr_shd_full_healer] >>>>> 0-glustershard-replicate-0: starting full sweep on subvol >>>>> glustershard-client-1 >>>>> [2016-08-30 13:56:25.224616] I [MSGID: 108026] >>>>> [afr-self-heald.c:646:afr_shd_full_healer] >>>>> 0-glustershard-replicate-0: starting full sweep on subvol >>>>> glustershard-client-2 >>>>> [2016-08-30 14:18:48.333740] I [MSGID: 108026] >>>>> [afr-self-heald.c:656:afr_shd_full_healer] >>>>> 0-glustershard-replicate-0: finished full sweep on subvol >>>>> glustershard-client-2 >>>>> [2016-08-30 14:18:48.356008] I [MSGID: 108026] >>>>> [afr-self-heald.c:656:afr_shd_full_healer] >>>>> 0-glustershard-replicate-0: finished full sweep on subvol >>>>> glustershard-client-1 >>>>> [2016-08-30 14:18:49.637811] I [MSGID: 108026] >>>>> [afr-self-heald.c:656:afr_shd_full_healer] >>>>> 0-glustershard-replicate-0: finished full sweep on subvol >>>>> glustershard-client-0 >>>>> >>>>> While when looking at past few days of the 3 prod nodes i only found >>>>> that on my 2nd node >>>>> [2016-08-27 01:26:42.638772] I [MSGID: 108026] >>>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>>> starting full sweep on subvol GLUSTER1-client-1 >>>>> [2016-08-27 11:37:01.732366] I [MSGID: 108026] >>>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>>> finished full sweep on subvol GLUSTER1-client-1 >>>>> [2016-08-27 12:58:34.597228] I [MSGID: 108026] >>>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>>> starting full sweep on subvol GLUSTER1-client-1 >>>>> [2016-08-27 12:59:28.041173] I [MSGID: 108026] >>>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>>> finished full sweep on subvol GLUSTER1-client-1 >>>>> [2016-08-27 20:03:42.560188] I [MSGID: 108026] >>>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>>> starting full sweep on subvol GLUSTER1-client-1 >>>>> [2016-08-27 20:03:44.278274] I [MSGID: 108026] >>>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>>> finished full sweep on subvol GLUSTER1-client-1 >>>>> [2016-08-27 21:00:42.603315] I [MSGID: 108026] >>>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>>> starting full sweep on subvol GLUSTER1-client-1 >>>>> [2016-08-27 21:00:46.148674] I [MSGID: 108026] >>>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>>> finished full sweep on subvol GLUSTER1-client-1 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>>> >>>>>>>>> >>>>>>>>>> My suspicion is that this is what happened on your setup. Could >>>>>>>>>> you confirm if that was the case? >>>>>>>>>> >>>>>>>>> >>>>>>>>> Brick was brought online with force start then a full heal >>>>>>>>> launched. Hours later after it became evident that it was not adding >>>>>>>>> new >>>>>>>>> files to heal I did try restarting self-heal daemon and relaunching >>>>>>>>> full >>>>>>>>> heal again. But this was after the heal had basically already failed >>>>>>>>> to >>>>>>>>> work as intended. >>>>>>>>> >>>>>>>> >>>>>>>> OK. How did you figure it was not adding any new files? I need to >>>>>>>> know what places you were monitoring to come to this conclusion. >>>>>>>> >>>>>>>> -Krutika >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> As for those logs, I did manager to do something that caused >>>>>>>>>> these warning messages you shared earlier to appear in my client and >>>>>>>>>> server >>>>>>>>>> logs. >>>>>>>>>> Although these logs are annoying and a bit scary too, they didn't >>>>>>>>>> do any harm to the data in my volume. Why they appear just after a >>>>>>>>>> brick is >>>>>>>>>> replaced and under no other circumstances is something I'm still >>>>>>>>>> investigating. >>>>>>>>>> >>>>>>>>>> But for future, it would be good to follow the steps Anuradha >>>>>>>>>> gave as that would allow self-heal to at least detect that it has >>>>>>>>>> some >>>>>>>>>> repairing to do whenever it is restarted whether intentionally or >>>>>>>>>> otherwise. >>>>>>>>>> >>>>>>>>> >>>>>>>>> I followed those steps as described on my test box and ended up >>>>>>>>> with exact same outcome of adding shards at an agonizing slow pace >>>>>>>>> and no >>>>>>>>> creation of .shard directory or heals on shard directory. Directories >>>>>>>>> visible from mount healed quickly. This was with one VM so it has >>>>>>>>> only 800 >>>>>>>>> shards as well. After hours at work it had added a total of 33 >>>>>>>>> shards to >>>>>>>>> be healed. I sent those logs yesterday as well though not the >>>>>>>>> glustershd. >>>>>>>>> >>>>>>>>> Does replace-brick command copy files in same manner? For these >>>>>>>>> purposes I am contemplating just skipping the heal route. >>>>>>>>> >>>>>>>>> >>>>>>>>>> -Krutika >>>>>>>>>> >>>>>>>>>> On Tue, Aug 30, 2016 at 2:22 AM, David Gossage < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> attached brick and client logs from test machine where same >>>>>>>>>>> behavior occurred not sure if anything new is there. its still on >>>>>>>>>>> 3.8.2 >>>>>>>>>>> >>>>>>>>>>> Number of Bricks: 1 x 3 = 3 >>>>>>>>>>> Transport-type: tcp >>>>>>>>>>> Bricks: >>>>>>>>>>> Brick1: 192.168.71.10:/gluster2/brick1/1 >>>>>>>>>>> Brick2: 192.168.71.11:/gluster2/brick2/1 >>>>>>>>>>> Brick3: 192.168.71.12:/gluster2/brick3/1 >>>>>>>>>>> Options Reconfigured: >>>>>>>>>>> cluster.locking-scheme: granular >>>>>>>>>>> performance.strict-o-direct: off >>>>>>>>>>> features.shard-block-size: 64MB >>>>>>>>>>> features.shard: on >>>>>>>>>>> server.allow-insecure: on >>>>>>>>>>> storage.owner-uid: 36 >>>>>>>>>>> storage.owner-gid: 36 >>>>>>>>>>> cluster.server-quorum-type: server >>>>>>>>>>> cluster.quorum-type: auto >>>>>>>>>>> network.remote-dio: on >>>>>>>>>>> cluster.eager-lock: enable >>>>>>>>>>> performance.stat-prefetch: off >>>>>>>>>>> performance.io-cache: off >>>>>>>>>>> performance.quick-read: off >>>>>>>>>>> cluster.self-heal-window-size: 1024 >>>>>>>>>>> cluster.background-self-heal-count: 16 >>>>>>>>>>> nfs.enable-ino32: off >>>>>>>>>>> nfs.addr-namelookup: off >>>>>>>>>>> nfs.disable: on >>>>>>>>>>> performance.read-ahead: off >>>>>>>>>>> performance.readdir-ahead: on >>>>>>>>>>> cluster.granular-entry-heal: on >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 29, 2016 at 2:20 PM, David Gossage < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> On Mon, Aug 29, 2016 at 7:01 AM, Anuradha Talur < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> ----- Original Message ----- >>>>>>>>>>>>> > From: "David Gossage" <[email protected]> >>>>>>>>>>>>> > To: "Anuradha Talur" <[email protected]> >>>>>>>>>>>>> > Cc: "[email protected] List" < >>>>>>>>>>>>> [email protected]>, "Krutika Dhananjay" < >>>>>>>>>>>>> [email protected]> >>>>>>>>>>>>> > Sent: Monday, August 29, 2016 5:12:42 PM >>>>>>>>>>>>> > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier >>>>>>>>>>>>> Slow >>>>>>>>>>>>> > >>>>>>>>>>>>> > On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> > >>>>>>>>>>>>> > > Response inline. >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > ----- Original Message ----- >>>>>>>>>>>>> > > > From: "Krutika Dhananjay" <[email protected]> >>>>>>>>>>>>> > > > To: "David Gossage" <[email protected]> >>>>>>>>>>>>> > > > Cc: "[email protected] List" < >>>>>>>>>>>>> [email protected]> >>>>>>>>>>>>> > > > Sent: Monday, August 29, 2016 3:55:04 PM >>>>>>>>>>>>> > > > Subject: Re: [Gluster-users] 3.8.3 Shards Healing >>>>>>>>>>>>> Glacier Slow >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > Could you attach both client and brick logs? Meanwhile I >>>>>>>>>>>>> will try these >>>>>>>>>>>>> > > steps >>>>>>>>>>>>> > > > out on my machines and see if it is easily recreatable. >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > -Krutika >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > On Mon, Aug 29, 2016 at 2:31 PM, David Gossage < >>>>>>>>>>>>> > > [email protected] >>>>>>>>>>>>> > > > > wrote: >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > Centos 7 Gluster 3.8.3 >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1 >>>>>>>>>>>>> > > > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1 >>>>>>>>>>>>> > > > Brick3: ccgl4.gl.local:/gluster1/BRICK1/1 >>>>>>>>>>>>> > > > Options Reconfigured: >>>>>>>>>>>>> > > > cluster.data-self-heal-algorithm: full >>>>>>>>>>>>> > > > cluster.self-heal-daemon: on >>>>>>>>>>>>> > > > cluster.locking-scheme: granular >>>>>>>>>>>>> > > > features.shard-block-size: 64MB >>>>>>>>>>>>> > > > features.shard: on >>>>>>>>>>>>> > > > performance.readdir-ahead: on >>>>>>>>>>>>> > > > storage.owner-uid: 36 >>>>>>>>>>>>> > > > storage.owner-gid: 36 >>>>>>>>>>>>> > > > performance.quick-read: off >>>>>>>>>>>>> > > > performance.read-ahead: off >>>>>>>>>>>>> > > > performance.io-cache: off >>>>>>>>>>>>> > > > performance.stat-prefetch: on >>>>>>>>>>>>> > > > cluster.eager-lock: enable >>>>>>>>>>>>> > > > network.remote-dio: enable >>>>>>>>>>>>> > > > cluster.quorum-type: auto >>>>>>>>>>>>> > > > cluster.server-quorum-type: server >>>>>>>>>>>>> > > > server.allow-insecure: on >>>>>>>>>>>>> > > > cluster.self-heal-window-size: 1024 >>>>>>>>>>>>> > > > cluster.background-self-heal-count: 16 >>>>>>>>>>>>> > > > performance.strict-write-ordering: off >>>>>>>>>>>>> > > > nfs.disable: on >>>>>>>>>>>>> > > > nfs.addr-namelookup: off >>>>>>>>>>>>> > > > nfs.enable-ino32: off >>>>>>>>>>>>> > > > cluster.granular-entry-heal: on >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > Friday did rolling upgrade from 3.8.3->3.8.3 no issues. >>>>>>>>>>>>> > > > Following steps detailed in previous recommendations >>>>>>>>>>>>> began proces of >>>>>>>>>>>>> > > > replacing and healngbricks one node at a time. >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > 1) kill pid of brick >>>>>>>>>>>>> > > > 2) reconfigure brick from raid6 to raid10 >>>>>>>>>>>>> > > > 3) recreate directory of brick >>>>>>>>>>>>> > > > 4) gluster volume start <> force >>>>>>>>>>>>> > > > 5) gluster volume heal <> full >>>>>>>>>>>>> > > Hi, >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > I'd suggest that full heal is not used. There are a few >>>>>>>>>>>>> bugs in full heal. >>>>>>>>>>>>> > > Better safe than sorry ;) >>>>>>>>>>>>> > > Instead I'd suggest the following steps: >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > Currently I brought the node down by systemctl stop >>>>>>>>>>>>> glusterd as I was >>>>>>>>>>>>> > getting sporadic io issues and a few VM's paused so hoping >>>>>>>>>>>>> that will help. >>>>>>>>>>>>> > I may wait to do this till around 4PM when most work is done >>>>>>>>>>>>> in case it >>>>>>>>>>>>> > shoots load up. >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > > 1) kill pid of brick >>>>>>>>>>>>> > > 2) to configuring of brick that you need >>>>>>>>>>>>> > > 3) recreate brick dir >>>>>>>>>>>>> > > 4) while the brick is still down, from the mount point: >>>>>>>>>>>>> > > a) create a dummy non existent dir under / of mount. >>>>>>>>>>>>> > > >>>>>>>>>>>>> > >>>>>>>>>>>>> > so if noee 2 is down brick, pick node for example 3 and make >>>>>>>>>>>>> a test dir >>>>>>>>>>>>> > under its brick directory that doesnt exist on 2 or should I >>>>>>>>>>>>> be dong this >>>>>>>>>>>>> > over a gluster mount? >>>>>>>>>>>>> You should be doing this over gluster mount. >>>>>>>>>>>>> > >>>>>>>>>>>>> > > b) set a non existent extended attribute on / of mount. >>>>>>>>>>>>> > > >>>>>>>>>>>>> > >>>>>>>>>>>>> > Could you give me an example of an attribute to set? I've >>>>>>>>>>>>> read a tad on >>>>>>>>>>>>> > this, and looked up attributes but haven't set any yet >>>>>>>>>>>>> myself. >>>>>>>>>>>>> > >>>>>>>>>>>>> Sure. setfattr -n "user.some-name" -v "some-value" >>>>>>>>>>>>> <path-to-mount> >>>>>>>>>>>>> > Doing these steps will ensure that heal happens only from >>>>>>>>>>>>> updated brick to >>>>>>>>>>>>> > > down brick. >>>>>>>>>>>>> > > 5) gluster v start <> force >>>>>>>>>>>>> > > 6) gluster v heal <> >>>>>>>>>>>>> > > >>>>>>>>>>>>> > >>>>>>>>>>>>> > Will it matter if somewhere in gluster the full heal command >>>>>>>>>>>>> was run other >>>>>>>>>>>>> > day? Not sure if it eventually stops or times out. >>>>>>>>>>>>> > >>>>>>>>>>>>> full heal will stop once the crawl is done. So if you want to >>>>>>>>>>>>> trigger heal again, >>>>>>>>>>>>> run gluster v heal <>. Actually even brick up or volume start >>>>>>>>>>>>> force should >>>>>>>>>>>>> trigger the heal. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Did this on test bed today. its one server with 3 bricks on >>>>>>>>>>>> same machine so take that for what its worth. also it still runs >>>>>>>>>>>> 3.8.2. >>>>>>>>>>>> Maybe ill update and re-run test. >>>>>>>>>>>> >>>>>>>>>>>> killed brick >>>>>>>>>>>> deleted brick dir >>>>>>>>>>>> recreated brick dir >>>>>>>>>>>> created fake dir on gluster mount >>>>>>>>>>>> set suggested fake attribute on it >>>>>>>>>>>> ran volume start <> force >>>>>>>>>>>> >>>>>>>>>>>> looked at files it said needed healing and it was just 8 shards >>>>>>>>>>>> that were modified for few minutes I ran through steps >>>>>>>>>>>> >>>>>>>>>>>> gave it few minutes and it stayed same >>>>>>>>>>>> ran gluster volume <> heal >>>>>>>>>>>> >>>>>>>>>>>> it healed all the directories and files you can see over mount >>>>>>>>>>>> including fakedir. >>>>>>>>>>>> >>>>>>>>>>>> same issue for shards though. it adds more shards to heal at >>>>>>>>>>>> glacier pace. slight jump in speed if I stat every file and dir >>>>>>>>>>>> in VM >>>>>>>>>>>> running but not all shards. >>>>>>>>>>>> >>>>>>>>>>>> It started with 8 shards to heal and is now only at 33 out of >>>>>>>>>>>> 800 and probably wont finish adding for few days at rate it goes. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > > 1st node worked as expected took 12 hours to heal 1TB >>>>>>>>>>>>> data. Load was >>>>>>>>>>>>> > > little >>>>>>>>>>>>> > > > heavy but nothing shocking. >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > About an hour after node 1 finished I began same process >>>>>>>>>>>>> on node2. Heal >>>>>>>>>>>>> > > > proces kicked in as before and the files in directories >>>>>>>>>>>>> visible from >>>>>>>>>>>>> > > mount >>>>>>>>>>>>> > > > and .glusterfs healed in short time. Then it began crawl >>>>>>>>>>>>> of .shard adding >>>>>>>>>>>>> > > > those files to heal count at which point the entire >>>>>>>>>>>>> proces ground to a >>>>>>>>>>>>> > > halt >>>>>>>>>>>>> > > > basically. After 48 hours out of 19k shards it has added >>>>>>>>>>>>> 5900 to heal >>>>>>>>>>>>> > > list. >>>>>>>>>>>>> > > > Load on all 3 machnes is negligible. It was suggested to >>>>>>>>>>>>> change this >>>>>>>>>>>>> > > value >>>>>>>>>>>>> > > > to full cluster.data-self-heal-algorithm and restart >>>>>>>>>>>>> volume which I >>>>>>>>>>>>> > > did. No >>>>>>>>>>>>> > > > efffect. Tried relaunching heal no effect, despite any >>>>>>>>>>>>> node picked. I >>>>>>>>>>>>> > > > started each VM and performed a stat of all files from >>>>>>>>>>>>> within it, or a >>>>>>>>>>>>> > > full >>>>>>>>>>>>> > > > virus scan and that seemed to cause short small spikes >>>>>>>>>>>>> in shards added, >>>>>>>>>>>>> > > but >>>>>>>>>>>>> > > > not by much. Logs are showing no real messages >>>>>>>>>>>>> indicating anything is >>>>>>>>>>>>> > > going >>>>>>>>>>>>> > > > on. I get hits to brick log on occasion of null lookups >>>>>>>>>>>>> making me think >>>>>>>>>>>>> > > its >>>>>>>>>>>>> > > > not really crawling shards directory but waiting for a >>>>>>>>>>>>> shard lookup to >>>>>>>>>>>>> > > add >>>>>>>>>>>>> > > > it. I'll get following in brick log but not constant and >>>>>>>>>>>>> sometime >>>>>>>>>>>>> > > multiple >>>>>>>>>>>>> > > > for same shard. >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > [2016-08-29 08:31:57.478125] W [MSGID: 115009] >>>>>>>>>>>>> > > > [server-resolve.c:569:server_resolve] >>>>>>>>>>>>> 0-GLUSTER1-server: no resolution >>>>>>>>>>>>> > > type >>>>>>>>>>>>> > > > for (null) (LOOKUP) >>>>>>>>>>>>> > > > [2016-08-29 08:31:57.478170] E [MSGID: 115050] >>>>>>>>>>>>> > > > [server-rpc-fops.c:156:server_lookup_cbk] >>>>>>>>>>>>> 0-GLUSTER1-server: 12591783: >>>>>>>>>>>>> > > > LOOKUP (null) (00000000-0000-0000-00 >>>>>>>>>>>>> > > > 00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) >>>>>>>>>>>>> ==> (Invalid >>>>>>>>>>>>> > > > argument) [Invalid argument] >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > This one repeated about 30 times in row then nothing for >>>>>>>>>>>>> 10 minutes then >>>>>>>>>>>>> > > one >>>>>>>>>>>>> > > > hit for one different shard by itself. >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > How can I determine if Heal is actually running? How can >>>>>>>>>>>>> I kill it or >>>>>>>>>>>>> > > force >>>>>>>>>>>>> > > > restart? Does node I start it from determine which >>>>>>>>>>>>> directory gets >>>>>>>>>>>>> > > crawled to >>>>>>>>>>>>> > > > determine heals? >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > David Gossage >>>>>>>>>>>>> > > > Carousel Checks Inc. | System Administrator >>>>>>>>>>>>> > > > Office 708.613.2284 >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > _______________________________________________ >>>>>>>>>>>>> > > > Gluster-users mailing list >>>>>>>>>>>>> > > > [email protected] >>>>>>>>>>>>> > > > http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > _______________________________________________ >>>>>>>>>>>>> > > > Gluster-users mailing list >>>>>>>>>>>>> > > > [email protected] >>>>>>>>>>>>> > > > http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > -- >>>>>>>>>>>>> > > Thanks, >>>>>>>>>>>>> > > Anuradha. >>>>>>>>>>>>> > > >>>>>>>>>>>>> > >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Anuradha. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
_______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
