Ignore. I just realised you're on 3.7.14. So then the problem may not be with granular entry self-heal feature.
-Krutika On Tue, Aug 30, 2016 at 10:14 AM, Krutika Dhananjay <[email protected]> wrote: > OK. Do you also have granular-entry-heal on - just so that I can isolate > the problem area. > > -Krutika > > On Tue, Aug 30, 2016 at 9:55 AM, Darrell Budic <[email protected]> > wrote: > >> I noticed that my new brick (replacement disk) did not have a .shard >> directory created on the brick, if that helps. >> >> I removed the affected brick from the volume and then wiped the disk, did >> an add-brick, and everything healed right up. I didn’t try and set any >> attrs or anything else, just removed and added the brick as new. >> >> On Aug 29, 2016, at 9:49 AM, Darrell Budic <[email protected]> >> wrote: >> >> Just to let you know I’m seeing the same issue under 3.7.14 on CentOS 7. >> Some content was healed correctly, now all the shards are queued up in a >> heal list, but nothing is healing. Got similar brick errors logged to the >> ones David was getting on the brick that isn’t healing: >> >> [2016-08-29 03:31:40.436110] E [MSGID: 115050] >> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822: >> LOOKUP (null) (00000000-0000-0000-0000-000000000000 >> /0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29) ==> (Invalid argument) >> [Invalid argument] >> [2016-08-29 03:31:43.005013] E [MSGID: 115050] >> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802: >> LOOKUP (null) (00000000-0000-0000-0000-000000000000 >> /0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40) ==> (Invalid argument) >> [Invalid argument] >> >> This was after replacing the drive the brick was on and trying to get it >> back into the system by setting the volume's fattr on the brick dir. I’ll >> try the suggested method here on it it shortly. >> >> -Darrell >> >> >> On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay <[email protected]> >> wrote: >> >> Got it. Thanks. >> >> I tried the same test and shd crashed with SIGABRT (well, that's because >> I compiled from src with -DDEBUG). >> In any case, this error would prevent full heal from proceeding further. >> I'm debugging the crash now. Will let you know when I have the RC. >> >> -Krutika >> >> On Mon, Aug 29, 2016 at 5:47 PM, David Gossage < >> [email protected]> wrote: >> >>> >>> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage < >>> [email protected]> wrote: >>> >>>> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay <[email protected] >>>> > wrote: >>>> >>>>> Could you attach both client and brick logs? Meanwhile I will try >>>>> these steps out on my machines and see if it is easily recreatable. >>>>> >>>>> >>>> Hoping 7z files are accepted by mail server. >>>> >>> >>> looks like zip file awaiting approval due to size >>> >>>> >>>> -Krutika >>>>> >>>>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage < >>>>> [email protected]> wrote: >>>>> >>>>>> Centos 7 Gluster 3.8.3 >>>>>> >>>>>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1 >>>>>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1 >>>>>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1 >>>>>> Options Reconfigured: >>>>>> cluster.data-self-heal-algorithm: full >>>>>> cluster.self-heal-daemon: on >>>>>> cluster.locking-scheme: granular >>>>>> features.shard-block-size: 64MB >>>>>> features.shard: on >>>>>> performance.readdir-ahead: on >>>>>> storage.owner-uid: 36 >>>>>> storage.owner-gid: 36 >>>>>> performance.quick-read: off >>>>>> performance.read-ahead: off >>>>>> performance.io-cache: off >>>>>> performance.stat-prefetch: on >>>>>> cluster.eager-lock: enable >>>>>> network.remote-dio: enable >>>>>> cluster.quorum-type: auto >>>>>> cluster.server-quorum-type: server >>>>>> server.allow-insecure: on >>>>>> cluster.self-heal-window-size: 1024 >>>>>> cluster.background-self-heal-count: 16 >>>>>> performance.strict-write-ordering: off >>>>>> nfs.disable: on >>>>>> nfs.addr-namelookup: off >>>>>> nfs.enable-ino32: off >>>>>> cluster.granular-entry-heal: on >>>>>> >>>>>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues. >>>>>> Following steps detailed in previous recommendations began proces of >>>>>> replacing and healngbricks one node at a time. >>>>>> >>>>>> 1) kill pid of brick >>>>>> 2) reconfigure brick from raid6 to raid10 >>>>>> 3) recreate directory of brick >>>>>> 4) gluster volume start <> force >>>>>> 5) gluster volume heal <> full >>>>>> >>>>>> 1st node worked as expected took 12 hours to heal 1TB data. Load was >>>>>> little heavy but nothing shocking. >>>>>> >>>>>> About an hour after node 1 finished I began same process on node2. >>>>>> Heal proces kicked in as before and the files in directories visible from >>>>>> mount and .glusterfs healed in short time. Then it began crawl of .shard >>>>>> adding those files to heal count at which point the entire proces ground >>>>>> to >>>>>> a halt basically. After 48 hours out of 19k shards it has added 5900 to >>>>>> heal list. Load on all 3 machnes is negligible. It was suggested to >>>>>> change this value to full cluster.data-self-heal-algorithm and >>>>>> restart volume which I did. No efffect. Tried relaunching heal no >>>>>> effect, >>>>>> despite any node picked. I started each VM and performed a stat of all >>>>>> files from within it, or a full virus scan and that seemed to cause >>>>>> short >>>>>> small spikes in shards added, but not by much. Logs are showing no real >>>>>> messages indicating anything is going on. I get hits to brick log on >>>>>> occasion of null lookups making me think its not really crawling shards >>>>>> directory but waiting for a shard lookup to add it. I'll get following >>>>>> in >>>>>> brick log but not constant and sometime multiple for same shard. >>>>>> >>>>>> [2016-08-29 08:31:57.478125] W [MSGID: 115009] >>>>>> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no >>>>>> resolution type for (null) (LOOKUP) >>>>>> [2016-08-29 08:31:57.478170] E [MSGID: 115050] >>>>>> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: >>>>>> 12591783: LOOKUP (null) (00000000-0000-0000-00 >>>>>> 00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> >>>>>> (Invalid argument) [Invalid argument] >>>>>> >>>>>> This one repeated about 30 times in row then nothing for 10 minutes >>>>>> then one hit for one different shard by itself. >>>>>> >>>>>> How can I determine if Heal is actually running? How can I kill it >>>>>> or force restart? Does node I start it from determine which directory >>>>>> gets >>>>>> crawled to determine heals? >>>>>> >>>>>> *David Gossage* >>>>>> *Carousel Checks Inc. | System Administrator* >>>>>> *Office* 708.613.2284 >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> [email protected] >>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>> >>>>> >>>> >>> >> _______________________________________________ >> Gluster-users mailing list >> [email protected] >> http://www.gluster.org/mailman/listinfo/gluster-users >> >> >> _______________________________________________ >> Gluster-users mailing list >> [email protected] >> http://www.gluster.org/mailman/listinfo/gluster-users >> >> >> >
_______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
