Tried this. With me, only 'fake2' gets healed after i bring the 'empty' brick back up and it stops there unless I do a 'heal-full'.
Is that what you're seeing as well? -Krutika On Wed, Aug 31, 2016 at 4:43 AM, David Gossage <[email protected]> wrote: > Same issue brought up glusterd on problem node heal count still stuck at > 6330. > > Ran gluster v heal GUSTER1 full > > glustershd on problem node shows a sweep starting and finishing in > seconds. Other 2 nodes show no activity in log. They should start a sweep > too shouldn't they? > > Tried starting from scratch > > kill -15 brickpid > rm -Rf /brick > mkdir -p /brick > mkdir mkdir /gsmount/fake2 > setfattr -n "user.some-name" -v "some-value" /gsmount/fake2 > > Heals visible dirs instantly then stops. > > gluster v heal GLUSTER1 full > > see sweep star on problem node and end almost instantly. no files added t > heal list no files healed no more logging > > [2016-08-30 23:11:31.544331] I [MSGID: 108026] > [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0: > starting full sweep on subvol GLUSTER1-client-1 > [2016-08-30 23:11:33.776235] I [MSGID: 108026] > [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0: > finished full sweep on subvol GLUSTER1-client-1 > > same results no matter which node you run command on. Still stuck with > 6330 files showing needing healed out of 19k. still showing in logs no > heals are occuring. > > Is their a way to forcibly reset any prior heal data? Could it be stuck > on some past failed heal start? > > > > > *David Gossage* > *Carousel Checks Inc. | System Administrator* > *Office* 708.613.2284 > > On Tue, Aug 30, 2016 at 10:03 AM, David Gossage < > [email protected]> wrote: > >> On Tue, Aug 30, 2016 at 10:02 AM, David Gossage < >> [email protected]> wrote: >> >>> updated test server to 3.8.3 >>> >>> Brick1: 192.168.71.10:/gluster2/brick1/1 >>> Brick2: 192.168.71.11:/gluster2/brick2/1 >>> Brick3: 192.168.71.12:/gluster2/brick3/1 >>> Options Reconfigured: >>> cluster.granular-entry-heal: on >>> performance.readdir-ahead: on >>> performance.read-ahead: off >>> nfs.disable: on >>> nfs.addr-namelookup: off >>> nfs.enable-ino32: off >>> cluster.background-self-heal-count: 16 >>> cluster.self-heal-window-size: 1024 >>> performance.quick-read: off >>> performance.io-cache: off >>> performance.stat-prefetch: off >>> cluster.eager-lock: enable >>> network.remote-dio: on >>> cluster.quorum-type: auto >>> cluster.server-quorum-type: server >>> storage.owner-gid: 36 >>> storage.owner-uid: 36 >>> server.allow-insecure: on >>> features.shard: on >>> features.shard-block-size: 64MB >>> performance.strict-o-direct: off >>> cluster.locking-scheme: granular >>> >>> kill -15 brickpid >>> rm -Rf /gluster2/brick3 >>> mkdir -p /gluster2/brick3/1 >>> mkdir mkdir /rhev/data-center/mnt/glusterSD/192.168.71.10\:_glustershard >>> /fake2 >>> setfattr -n "user.some-name" -v "some-value" >>> /rhev/data-center/mnt/glusterSD/192.168.71.10\:_glustershard/fake2 >>> gluster v start glustershard force >>> >>> at this point brick process starts and all visible files including new >>> dir are made on brick >>> handful of shards are in heal statistics still but no .shard directory >>> created and no increase in shard count >>> >>> gluster v heal glustershard >>> >>> At this point still no increase in count or dir made no additional >>> activity in logs for healing generated. waited few minutes tailing logs to >>> check if anything kicked in. >>> >>> gluster v heal glustershard full >>> >>> gluster shards added to list and heal commences. logs show full sweep >>> starting on all 3 nodes. though this time it only shows as finishing on >>> one which looks to be the one that had brick deleted. >>> >>> [2016-08-30 14:45:33.098589] I [MSGID: 108026] >>> [afr-self-heald.c:646:afr_shd_full_healer] 0-glustershard-replicate-0: >>> starting full sweep on subvol glustershard-client-0 >>> [2016-08-30 14:45:33.099492] I [MSGID: 108026] >>> [afr-self-heald.c:646:afr_shd_full_healer] 0-glustershard-replicate-0: >>> starting full sweep on subvol glustershard-client-1 >>> [2016-08-30 14:45:33.100093] I [MSGID: 108026] >>> [afr-self-heald.c:646:afr_shd_full_healer] 0-glustershard-replicate-0: >>> starting full sweep on subvol glustershard-client-2 >>> [2016-08-30 14:52:29.760213] I [MSGID: 108026] >>> [afr-self-heald.c:656:afr_shd_full_healer] 0-glustershard-replicate-0: >>> finished full sweep on subvol glustershard-client-2 >>> >> >> Just realized its still healing so that may be why sweep on 2 other >> bricks haven't replied as finished. >> >>> >>> >>> my hope is that later tonight a full heal will work on production. Is >>> it possible self-heal daemon can get stale or stop listening but still show >>> as active? Would stopping and starting self-heal daemon from gluster cli >>> before doing these heals be helpful? >>> >>> >>> On Tue, Aug 30, 2016 at 9:29 AM, David Gossage < >>> [email protected]> wrote: >>> >>>> On Tue, Aug 30, 2016 at 8:52 AM, David Gossage < >>>> [email protected]> wrote: >>>> >>>>> On Tue, Aug 30, 2016 at 8:01 AM, Krutika Dhananjay < >>>>> [email protected]> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Tue, Aug 30, 2016 at 6:20 PM, Krutika Dhananjay < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 30, 2016 at 6:07 PM, David Gossage < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> On Tue, Aug 30, 2016 at 7:18 AM, Krutika Dhananjay < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Could you also share the glustershd logs? >>>>>>>>> >>>>>>>> >>>>>>>> I'll get them when I get to work sure >>>>>>>> >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> I tried the same steps that you mentioned multiple times, but heal >>>>>>>>> is running to completion without any issues. >>>>>>>>> >>>>>>>>> It must be said that 'heal full' traverses the files and >>>>>>>>> directories in a depth-first order and does heals also in the same >>>>>>>>> order. >>>>>>>>> But if it gets interrupted in the middle (say because >>>>>>>>> self-heal-daemon was >>>>>>>>> either intentionally or unintentionally brought offline and then >>>>>>>>> brought >>>>>>>>> back up), self-heal will only pick up the entries that are so far >>>>>>>>> marked as >>>>>>>>> new-entries that need heal which it will find in indices/xattrop >>>>>>>>> directory. >>>>>>>>> What this means is that those files and directories that were not >>>>>>>>> visited >>>>>>>>> during the crawl, will remain untouched and unhealed in this second >>>>>>>>> iteration of heal, unless you execute a 'heal-full' again. >>>>>>>>> >>>>>>>> >>>>>>>> So should it start healing shards as it crawls or not until after >>>>>>>> it crawls the entire .shard directory? At the pace it was going that >>>>>>>> could >>>>>>>> be a week with one node appearing in the cluster but with no shard >>>>>>>> files if >>>>>>>> anything tries to access a file on that node. From my experience >>>>>>>> other day >>>>>>>> telling it to heal full again did nothing regardless of node used. >>>>>>>> >>>>>>> >>>>>> Crawl is started from '/' of the volume. Whenever self-heal detects >>>>>> during the crawl that a file or directory is present in some brick(s) and >>>>>> absent in others, it creates the file on the bricks where it is absent >>>>>> and >>>>>> marks the fact that the file or directory might need data/entry and >>>>>> metadata heal too (this also means that an index is created under >>>>>> .glusterfs/indices/xattrop of the src bricks). And the data/entry and >>>>>> metadata heal are picked up and done in >>>>>> >>>>> the background with the help of these indices. >>>>>> >>>>> >>>>> Looking at my 3rd node as example i find nearly an exact same number >>>>> of files in xattrop dir as reported by heal count at time I brought down >>>>> node2 to try and alleviate read io errors that seemed to occur from what I >>>>> was guessing as attempts to use the node with no shards for reads. >>>>> >>>>> Also attached are the glustershd logs from the 3 nodes, along with the >>>>> test node i tried yesterday with same results. >>>>> >>>> >>>> Looking at my own logs I notice that a full sweep was only ever >>>> recorded in glustershd.log on 2nd node with missing directory. I believe I >>>> should have found a sweep begun on every node correct? >>>> >>>> On my test dev when it did work I do see that >>>> >>>> [2016-08-30 13:56:25.223333] I [MSGID: 108026] >>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-glustershard-replicate-0: >>>> starting full sweep on subvol glustershard-client-0 >>>> [2016-08-30 13:56:25.223522] I [MSGID: 108026] >>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-glustershard-replicate-0: >>>> starting full sweep on subvol glustershard-client-1 >>>> [2016-08-30 13:56:25.224616] I [MSGID: 108026] >>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-glustershard-replicate-0: >>>> starting full sweep on subvol glustershard-client-2 >>>> [2016-08-30 14:18:48.333740] I [MSGID: 108026] >>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-glustershard-replicate-0: >>>> finished full sweep on subvol glustershard-client-2 >>>> [2016-08-30 14:18:48.356008] I [MSGID: 108026] >>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-glustershard-replicate-0: >>>> finished full sweep on subvol glustershard-client-1 >>>> [2016-08-30 14:18:49.637811] I [MSGID: 108026] >>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-glustershard-replicate-0: >>>> finished full sweep on subvol glustershard-client-0 >>>> >>>> While when looking at past few days of the 3 prod nodes i only found >>>> that on my 2nd node >>>> [2016-08-27 01:26:42.638772] I [MSGID: 108026] >>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>> starting full sweep on subvol GLUSTER1-client-1 >>>> [2016-08-27 11:37:01.732366] I [MSGID: 108026] >>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>> finished full sweep on subvol GLUSTER1-client-1 >>>> [2016-08-27 12:58:34.597228] I [MSGID: 108026] >>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>> starting full sweep on subvol GLUSTER1-client-1 >>>> [2016-08-27 12:59:28.041173] I [MSGID: 108026] >>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>> finished full sweep on subvol GLUSTER1-client-1 >>>> [2016-08-27 20:03:42.560188] I [MSGID: 108026] >>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>> starting full sweep on subvol GLUSTER1-client-1 >>>> [2016-08-27 20:03:44.278274] I [MSGID: 108026] >>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>> finished full sweep on subvol GLUSTER1-client-1 >>>> [2016-08-27 21:00:42.603315] I [MSGID: 108026] >>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>> starting full sweep on subvol GLUSTER1-client-1 >>>> [2016-08-27 21:00:46.148674] I [MSGID: 108026] >>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0: >>>> finished full sweep on subvol GLUSTER1-client-1 >>>> >>>> >>>> >>>> >>>> >>>>> >>>>>> >>>>>>>> >>>>>>>>> My suspicion is that this is what happened on your setup. Could >>>>>>>>> you confirm if that was the case? >>>>>>>>> >>>>>>>> >>>>>>>> Brick was brought online with force start then a full heal >>>>>>>> launched. Hours later after it became evident that it was not adding >>>>>>>> new >>>>>>>> files to heal I did try restarting self-heal daemon and relaunching >>>>>>>> full >>>>>>>> heal again. But this was after the heal had basically already failed to >>>>>>>> work as intended. >>>>>>>> >>>>>>> >>>>>>> OK. How did you figure it was not adding any new files? I need to >>>>>>> know what places you were monitoring to come to this conclusion. >>>>>>> >>>>>>> -Krutika >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> As for those logs, I did manager to do something that caused these >>>>>>>>> warning messages you shared earlier to appear in my client and server >>>>>>>>> logs. >>>>>>>>> Although these logs are annoying and a bit scary too, they didn't >>>>>>>>> do any harm to the data in my volume. Why they appear just after a >>>>>>>>> brick is >>>>>>>>> replaced and under no other circumstances is something I'm still >>>>>>>>> investigating. >>>>>>>>> >>>>>>>>> But for future, it would be good to follow the steps Anuradha gave >>>>>>>>> as that would allow self-heal to at least detect that it has some >>>>>>>>> repairing >>>>>>>>> to do whenever it is restarted whether intentionally or otherwise. >>>>>>>>> >>>>>>>> >>>>>>>> I followed those steps as described on my test box and ended up >>>>>>>> with exact same outcome of adding shards at an agonizing slow pace and >>>>>>>> no >>>>>>>> creation of .shard directory or heals on shard directory. Directories >>>>>>>> visible from mount healed quickly. This was with one VM so it has >>>>>>>> only 800 >>>>>>>> shards as well. After hours at work it had added a total of 33 shards >>>>>>>> to >>>>>>>> be healed. I sent those logs yesterday as well though not the >>>>>>>> glustershd. >>>>>>>> >>>>>>>> Does replace-brick command copy files in same manner? For these >>>>>>>> purposes I am contemplating just skipping the heal route. >>>>>>>> >>>>>>>> >>>>>>>>> -Krutika >>>>>>>>> >>>>>>>>> On Tue, Aug 30, 2016 at 2:22 AM, David Gossage < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> attached brick and client logs from test machine where same >>>>>>>>>> behavior occurred not sure if anything new is there. its still on >>>>>>>>>> 3.8.2 >>>>>>>>>> >>>>>>>>>> Number of Bricks: 1 x 3 = 3 >>>>>>>>>> Transport-type: tcp >>>>>>>>>> Bricks: >>>>>>>>>> Brick1: 192.168.71.10:/gluster2/brick1/1 >>>>>>>>>> Brick2: 192.168.71.11:/gluster2/brick2/1 >>>>>>>>>> Brick3: 192.168.71.12:/gluster2/brick3/1 >>>>>>>>>> Options Reconfigured: >>>>>>>>>> cluster.locking-scheme: granular >>>>>>>>>> performance.strict-o-direct: off >>>>>>>>>> features.shard-block-size: 64MB >>>>>>>>>> features.shard: on >>>>>>>>>> server.allow-insecure: on >>>>>>>>>> storage.owner-uid: 36 >>>>>>>>>> storage.owner-gid: 36 >>>>>>>>>> cluster.server-quorum-type: server >>>>>>>>>> cluster.quorum-type: auto >>>>>>>>>> network.remote-dio: on >>>>>>>>>> cluster.eager-lock: enable >>>>>>>>>> performance.stat-prefetch: off >>>>>>>>>> performance.io-cache: off >>>>>>>>>> performance.quick-read: off >>>>>>>>>> cluster.self-heal-window-size: 1024 >>>>>>>>>> cluster.background-self-heal-count: 16 >>>>>>>>>> nfs.enable-ino32: off >>>>>>>>>> nfs.addr-namelookup: off >>>>>>>>>> nfs.disable: on >>>>>>>>>> performance.read-ahead: off >>>>>>>>>> performance.readdir-ahead: on >>>>>>>>>> cluster.granular-entry-heal: on >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Aug 29, 2016 at 2:20 PM, David Gossage < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> On Mon, Aug 29, 2016 at 7:01 AM, Anuradha Talur < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ----- Original Message ----- >>>>>>>>>>>> > From: "David Gossage" <[email protected]> >>>>>>>>>>>> > To: "Anuradha Talur" <[email protected]> >>>>>>>>>>>> > Cc: "[email protected] List" < >>>>>>>>>>>> [email protected]>, "Krutika Dhananjay" < >>>>>>>>>>>> [email protected]> >>>>>>>>>>>> > Sent: Monday, August 29, 2016 5:12:42 PM >>>>>>>>>>>> > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow >>>>>>>>>>>> > >>>>>>>>>>>> > On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> > >>>>>>>>>>>> > > Response inline. >>>>>>>>>>>> > > >>>>>>>>>>>> > > ----- Original Message ----- >>>>>>>>>>>> > > > From: "Krutika Dhananjay" <[email protected]> >>>>>>>>>>>> > > > To: "David Gossage" <[email protected]> >>>>>>>>>>>> > > > Cc: "[email protected] List" < >>>>>>>>>>>> [email protected]> >>>>>>>>>>>> > > > Sent: Monday, August 29, 2016 3:55:04 PM >>>>>>>>>>>> > > > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier >>>>>>>>>>>> Slow >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > Could you attach both client and brick logs? Meanwhile I >>>>>>>>>>>> will try these >>>>>>>>>>>> > > steps >>>>>>>>>>>> > > > out on my machines and see if it is easily recreatable. >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > -Krutika >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > On Mon, Aug 29, 2016 at 2:31 PM, David Gossage < >>>>>>>>>>>> > > [email protected] >>>>>>>>>>>> > > > > wrote: >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > Centos 7 Gluster 3.8.3 >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1 >>>>>>>>>>>> > > > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1 >>>>>>>>>>>> > > > Brick3: ccgl4.gl.local:/gluster1/BRICK1/1 >>>>>>>>>>>> > > > Options Reconfigured: >>>>>>>>>>>> > > > cluster.data-self-heal-algorithm: full >>>>>>>>>>>> > > > cluster.self-heal-daemon: on >>>>>>>>>>>> > > > cluster.locking-scheme: granular >>>>>>>>>>>> > > > features.shard-block-size: 64MB >>>>>>>>>>>> > > > features.shard: on >>>>>>>>>>>> > > > performance.readdir-ahead: on >>>>>>>>>>>> > > > storage.owner-uid: 36 >>>>>>>>>>>> > > > storage.owner-gid: 36 >>>>>>>>>>>> > > > performance.quick-read: off >>>>>>>>>>>> > > > performance.read-ahead: off >>>>>>>>>>>> > > > performance.io-cache: off >>>>>>>>>>>> > > > performance.stat-prefetch: on >>>>>>>>>>>> > > > cluster.eager-lock: enable >>>>>>>>>>>> > > > network.remote-dio: enable >>>>>>>>>>>> > > > cluster.quorum-type: auto >>>>>>>>>>>> > > > cluster.server-quorum-type: server >>>>>>>>>>>> > > > server.allow-insecure: on >>>>>>>>>>>> > > > cluster.self-heal-window-size: 1024 >>>>>>>>>>>> > > > cluster.background-self-heal-count: 16 >>>>>>>>>>>> > > > performance.strict-write-ordering: off >>>>>>>>>>>> > > > nfs.disable: on >>>>>>>>>>>> > > > nfs.addr-namelookup: off >>>>>>>>>>>> > > > nfs.enable-ino32: off >>>>>>>>>>>> > > > cluster.granular-entry-heal: on >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > Friday did rolling upgrade from 3.8.3->3.8.3 no issues. >>>>>>>>>>>> > > > Following steps detailed in previous recommendations >>>>>>>>>>>> began proces of >>>>>>>>>>>> > > > replacing and healngbricks one node at a time. >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > 1) kill pid of brick >>>>>>>>>>>> > > > 2) reconfigure brick from raid6 to raid10 >>>>>>>>>>>> > > > 3) recreate directory of brick >>>>>>>>>>>> > > > 4) gluster volume start <> force >>>>>>>>>>>> > > > 5) gluster volume heal <> full >>>>>>>>>>>> > > Hi, >>>>>>>>>>>> > > >>>>>>>>>>>> > > I'd suggest that full heal is not used. There are a few >>>>>>>>>>>> bugs in full heal. >>>>>>>>>>>> > > Better safe than sorry ;) >>>>>>>>>>>> > > Instead I'd suggest the following steps: >>>>>>>>>>>> > > >>>>>>>>>>>> > > Currently I brought the node down by systemctl stop >>>>>>>>>>>> glusterd as I was >>>>>>>>>>>> > getting sporadic io issues and a few VM's paused so hoping >>>>>>>>>>>> that will help. >>>>>>>>>>>> > I may wait to do this till around 4PM when most work is done >>>>>>>>>>>> in case it >>>>>>>>>>>> > shoots load up. >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > > 1) kill pid of brick >>>>>>>>>>>> > > 2) to configuring of brick that you need >>>>>>>>>>>> > > 3) recreate brick dir >>>>>>>>>>>> > > 4) while the brick is still down, from the mount point: >>>>>>>>>>>> > > a) create a dummy non existent dir under / of mount. >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> > so if noee 2 is down brick, pick node for example 3 and make >>>>>>>>>>>> a test dir >>>>>>>>>>>> > under its brick directory that doesnt exist on 2 or should I >>>>>>>>>>>> be dong this >>>>>>>>>>>> > over a gluster mount? >>>>>>>>>>>> You should be doing this over gluster mount. >>>>>>>>>>>> > >>>>>>>>>>>> > > b) set a non existent extended attribute on / of mount. >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> > Could you give me an example of an attribute to set? I've >>>>>>>>>>>> read a tad on >>>>>>>>>>>> > this, and looked up attributes but haven't set any yet myself. >>>>>>>>>>>> > >>>>>>>>>>>> Sure. setfattr -n "user.some-name" -v "some-value" >>>>>>>>>>>> <path-to-mount> >>>>>>>>>>>> > Doing these steps will ensure that heal happens only from >>>>>>>>>>>> updated brick to >>>>>>>>>>>> > > down brick. >>>>>>>>>>>> > > 5) gluster v start <> force >>>>>>>>>>>> > > 6) gluster v heal <> >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> > Will it matter if somewhere in gluster the full heal command >>>>>>>>>>>> was run other >>>>>>>>>>>> > day? Not sure if it eventually stops or times out. >>>>>>>>>>>> > >>>>>>>>>>>> full heal will stop once the crawl is done. So if you want to >>>>>>>>>>>> trigger heal again, >>>>>>>>>>>> run gluster v heal <>. Actually even brick up or volume start >>>>>>>>>>>> force should >>>>>>>>>>>> trigger the heal. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Did this on test bed today. its one server with 3 bricks on >>>>>>>>>>> same machine so take that for what its worth. also it still runs >>>>>>>>>>> 3.8.2. >>>>>>>>>>> Maybe ill update and re-run test. >>>>>>>>>>> >>>>>>>>>>> killed brick >>>>>>>>>>> deleted brick dir >>>>>>>>>>> recreated brick dir >>>>>>>>>>> created fake dir on gluster mount >>>>>>>>>>> set suggested fake attribute on it >>>>>>>>>>> ran volume start <> force >>>>>>>>>>> >>>>>>>>>>> looked at files it said needed healing and it was just 8 shards >>>>>>>>>>> that were modified for few minutes I ran through steps >>>>>>>>>>> >>>>>>>>>>> gave it few minutes and it stayed same >>>>>>>>>>> ran gluster volume <> heal >>>>>>>>>>> >>>>>>>>>>> it healed all the directories and files you can see over mount >>>>>>>>>>> including fakedir. >>>>>>>>>>> >>>>>>>>>>> same issue for shards though. it adds more shards to heal at >>>>>>>>>>> glacier pace. slight jump in speed if I stat every file and dir in >>>>>>>>>>> VM >>>>>>>>>>> running but not all shards. >>>>>>>>>>> >>>>>>>>>>> It started with 8 shards to heal and is now only at 33 out of >>>>>>>>>>> 800 and probably wont finish adding for few days at rate it goes. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> > > > 1st node worked as expected took 12 hours to heal 1TB >>>>>>>>>>>> data. Load was >>>>>>>>>>>> > > little >>>>>>>>>>>> > > > heavy but nothing shocking. >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > About an hour after node 1 finished I began same process >>>>>>>>>>>> on node2. Heal >>>>>>>>>>>> > > > proces kicked in as before and the files in directories >>>>>>>>>>>> visible from >>>>>>>>>>>> > > mount >>>>>>>>>>>> > > > and .glusterfs healed in short time. Then it began crawl >>>>>>>>>>>> of .shard adding >>>>>>>>>>>> > > > those files to heal count at which point the entire >>>>>>>>>>>> proces ground to a >>>>>>>>>>>> > > halt >>>>>>>>>>>> > > > basically. After 48 hours out of 19k shards it has added >>>>>>>>>>>> 5900 to heal >>>>>>>>>>>> > > list. >>>>>>>>>>>> > > > Load on all 3 machnes is negligible. It was suggested to >>>>>>>>>>>> change this >>>>>>>>>>>> > > value >>>>>>>>>>>> > > > to full cluster.data-self-heal-algorithm and restart >>>>>>>>>>>> volume which I >>>>>>>>>>>> > > did. No >>>>>>>>>>>> > > > efffect. Tried relaunching heal no effect, despite any >>>>>>>>>>>> node picked. I >>>>>>>>>>>> > > > started each VM and performed a stat of all files from >>>>>>>>>>>> within it, or a >>>>>>>>>>>> > > full >>>>>>>>>>>> > > > virus scan and that seemed to cause short small spikes in >>>>>>>>>>>> shards added, >>>>>>>>>>>> > > but >>>>>>>>>>>> > > > not by much. Logs are showing no real messages indicating >>>>>>>>>>>> anything is >>>>>>>>>>>> > > going >>>>>>>>>>>> > > > on. I get hits to brick log on occasion of null lookups >>>>>>>>>>>> making me think >>>>>>>>>>>> > > its >>>>>>>>>>>> > > > not really crawling shards directory but waiting for a >>>>>>>>>>>> shard lookup to >>>>>>>>>>>> > > add >>>>>>>>>>>> > > > it. I'll get following in brick log but not constant and >>>>>>>>>>>> sometime >>>>>>>>>>>> > > multiple >>>>>>>>>>>> > > > for same shard. >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > [2016-08-29 08:31:57.478125] W [MSGID: 115009] >>>>>>>>>>>> > > > [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: >>>>>>>>>>>> no resolution >>>>>>>>>>>> > > type >>>>>>>>>>>> > > > for (null) (LOOKUP) >>>>>>>>>>>> > > > [2016-08-29 08:31:57.478170] E [MSGID: 115050] >>>>>>>>>>>> > > > [server-rpc-fops.c:156:server_lookup_cbk] >>>>>>>>>>>> 0-GLUSTER1-server: 12591783: >>>>>>>>>>>> > > > LOOKUP (null) (00000000-0000-0000-00 >>>>>>>>>>>> > > > 00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) >>>>>>>>>>>> ==> (Invalid >>>>>>>>>>>> > > > argument) [Invalid argument] >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > This one repeated about 30 times in row then nothing for >>>>>>>>>>>> 10 minutes then >>>>>>>>>>>> > > one >>>>>>>>>>>> > > > hit for one different shard by itself. >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > How can I determine if Heal is actually running? How can >>>>>>>>>>>> I kill it or >>>>>>>>>>>> > > force >>>>>>>>>>>> > > > restart? Does node I start it from determine which >>>>>>>>>>>> directory gets >>>>>>>>>>>> > > crawled to >>>>>>>>>>>> > > > determine heals? >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > David Gossage >>>>>>>>>>>> > > > Carousel Checks Inc. | System Administrator >>>>>>>>>>>> > > > Office 708.613.2284 >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > _______________________________________________ >>>>>>>>>>>> > > > Gluster-users mailing list >>>>>>>>>>>> > > > [email protected] >>>>>>>>>>>> > > > http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > _______________________________________________ >>>>>>>>>>>> > > > Gluster-users mailing list >>>>>>>>>>>> > > > [email protected] >>>>>>>>>>>> > > > http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>> > > >>>>>>>>>>>> > > -- >>>>>>>>>>>> > > Thanks, >>>>>>>>>>>> > > Anuradha. >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Anuradha. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
_______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
