Got it. Thanks. I tried the same test and shd crashed with SIGABRT (well, that's because I compiled from src with -DDEBUG). In any case, this error would prevent full heal from proceeding further. I'm debugging the crash now. Will let you know when I have the RC.
-Krutika On Mon, Aug 29, 2016 at 5:47 PM, David Gossage <[email protected]> wrote: > > On Mon, Aug 29, 2016 at 7:14 AM, David Gossage < > [email protected]> wrote: > >> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay <[email protected]> >> wrote: >> >>> Could you attach both client and brick logs? Meanwhile I will try these >>> steps out on my machines and see if it is easily recreatable. >>> >>> >> Hoping 7z files are accepted by mail server. >> > > looks like zip file awaiting approval due to size > >> >> -Krutika >>> >>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage < >>> [email protected]> wrote: >>> >>>> Centos 7 Gluster 3.8.3 >>>> >>>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1 >>>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1 >>>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1 >>>> Options Reconfigured: >>>> cluster.data-self-heal-algorithm: full >>>> cluster.self-heal-daemon: on >>>> cluster.locking-scheme: granular >>>> features.shard-block-size: 64MB >>>> features.shard: on >>>> performance.readdir-ahead: on >>>> storage.owner-uid: 36 >>>> storage.owner-gid: 36 >>>> performance.quick-read: off >>>> performance.read-ahead: off >>>> performance.io-cache: off >>>> performance.stat-prefetch: on >>>> cluster.eager-lock: enable >>>> network.remote-dio: enable >>>> cluster.quorum-type: auto >>>> cluster.server-quorum-type: server >>>> server.allow-insecure: on >>>> cluster.self-heal-window-size: 1024 >>>> cluster.background-self-heal-count: 16 >>>> performance.strict-write-ordering: off >>>> nfs.disable: on >>>> nfs.addr-namelookup: off >>>> nfs.enable-ino32: off >>>> cluster.granular-entry-heal: on >>>> >>>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues. >>>> Following steps detailed in previous recommendations began proces of >>>> replacing and healngbricks one node at a time. >>>> >>>> 1) kill pid of brick >>>> 2) reconfigure brick from raid6 to raid10 >>>> 3) recreate directory of brick >>>> 4) gluster volume start <> force >>>> 5) gluster volume heal <> full >>>> >>>> 1st node worked as expected took 12 hours to heal 1TB data. Load was >>>> little heavy but nothing shocking. >>>> >>>> About an hour after node 1 finished I began same process on node2. >>>> Heal proces kicked in as before and the files in directories visible from >>>> mount and .glusterfs healed in short time. Then it began crawl of .shard >>>> adding those files to heal count at which point the entire proces ground to >>>> a halt basically. After 48 hours out of 19k shards it has added 5900 to >>>> heal list. Load on all 3 machnes is negligible. It was suggested to >>>> change this value to full cluster.data-self-heal-algorithm and restart >>>> volume which I did. No efffect. Tried relaunching heal no effect, despite >>>> any node picked. I started each VM and performed a stat of all files from >>>> within it, or a full virus scan and that seemed to cause short small >>>> spikes in shards added, but not by much. Logs are showing no real messages >>>> indicating anything is going on. I get hits to brick log on occasion of >>>> null lookups making me think its not really crawling shards directory but >>>> waiting for a shard lookup to add it. I'll get following in brick log but >>>> not constant and sometime multiple for same shard. >>>> >>>> [2016-08-29 08:31:57.478125] W [MSGID: 115009] >>>> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution >>>> type for (null) (LOOKUP) >>>> [2016-08-29 08:31:57.478170] E [MSGID: 115050] >>>> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783: >>>> LOOKUP (null) (00000000-0000-0000-00 >>>> 00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid >>>> argument) [Invalid argument] >>>> >>>> This one repeated about 30 times in row then nothing for 10 minutes >>>> then one hit for one different shard by itself. >>>> >>>> How can I determine if Heal is actually running? How can I kill it or >>>> force restart? Does node I start it from determine which directory gets >>>> crawled to determine heals? >>>> >>>> *David Gossage* >>>> *Carousel Checks Inc. | System Administrator* >>>> *Office* 708.613.2284 >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> [email protected] >>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>> >>> >>> >> >
_______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
