Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

Darrell Budic Mon, 29 Aug 2016 21:26:34 -0700

I noticed that my new brick (replacement disk) did not have a .shard directory 
created on the brick, if that helps.


I removed the affected brick from the volume and then wiped the disk, did an 
add-brick, and everything healed right up. I didn’t try and set any attrs or 
anything else, just removed and added the brick as new.

> On Aug 29, 2016, at 9:49 AM, Darrell Budic <bu...@onholyground.com> wrote:
> 
> Just to let you know I’m seeing the same issue under 3.7.14 on CentOS 7. Some 
> content was healed correctly, now all the shards are queued up in a heal 
> list, but nothing is healing. Got similar brick errors logged to the ones 
> David was getting on the brick that isn’t healing:
> 
> [2016-08-29 03:31:40.436110] E [MSGID: 115050] 
> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822: LOOKUP 
> (null) 
> (00000000-0000-0000-0000-000000000000/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29)
>  ==> (Invalid argument) [Invalid argument]
> [2016-08-29 03:31:43.005013] E [MSGID: 115050] 
> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802: LOOKUP 
> (null) 
> (00000000-0000-0000-0000-000000000000/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40)
>  ==> (Invalid argument) [Invalid argument]
> 
> This was after replacing the drive the brick was on and trying to get it back 
> into the system by setting the volume's fattr on the brick dir. I’ll try the 
> suggested method here on it it shortly.
> 
>   -Darrell
> 
> 
>> On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay <kdhan...@redhat.com 
>> <mailto:kdhan...@redhat.com>> wrote:
>> 
>> Got it. Thanks.
>> 
>> I tried the same test and shd crashed with SIGABRT (well, that's because I 
>> compiled from src with -DDEBUG).
>> In any case, this error would prevent full heal from proceeding further.
>> I'm debugging the crash now. Will let you know when I have the RC.
>> 
>> -Krutika
>> 
>> On Mon, Aug 29, 2016 at 5:47 PM, David Gossage <dgoss...@carouselchecks.com 
>> <mailto:dgoss...@carouselchecks.com>> wrote:
>> 
>> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage <dgoss...@carouselchecks.com 
>> <mailto:dgoss...@carouselchecks.com>> wrote:
>> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay <kdhan...@redhat.com 
>> <mailto:kdhan...@redhat.com>> wrote:
>> Could you attach both client and brick logs? Meanwhile I will try these 
>> steps out on my machines and see if it is easily recreatable.
>> 
>> 
>> Hoping 7z files are accepted by mail server.
>> 
>> looks like zip file awaiting approval due to size 
>> 
>> -Krutika
>> 
>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <dgoss...@carouselchecks.com 
>> <mailto:dgoss...@carouselchecks.com>> wrote:
>> Centos 7 Gluster 3.8.3
>> 
>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
>> Options Reconfigured:
>> cluster.data-self-heal-algorithm: full
>> cluster.self-heal-daemon: on
>> cluster.locking-scheme: granular
>> features.shard-block-size: 64MB
>> features.shard: on
>> performance.readdir-ahead: on
>> storage.owner-uid: 36
>> storage.owner-gid: 36
>> performance.quick-read: off
>> performance.read-ahead: off
>> performance.io <http://performance.io/>-cache: off
>> performance.stat-prefetch: on
>> cluster.eager-lock: enable
>> network.remote-dio: enable
>> cluster.quorum-type: auto
>> cluster.server-quorum-type: server
>> server.allow-insecure: on
>> cluster.self-heal-window-size: 1024
>> cluster.background-self-heal-count: 16
>> performance.strict-write-ordering: off
>> nfs.disable: on
>> nfs.addr-namelookup: off
>> nfs.enable-ino32: off
>> cluster.granular-entry-heal: on
>> 
>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
>> Following steps detailed in previous recommendations began proces of 
>> replacing and healngbricks one node at a time.
>> 
>> 1) kill pid of brick
>> 2) reconfigure brick from raid6 to raid10
>> 3) recreate directory of brick
>> 4) gluster volume start <> force
>> 5) gluster volume heal <> full
>> 
>> 1st node worked as expected took 12 hours to heal 1TB data.  Load was little 
>> heavy but nothing shocking.
>> 
>> About an hour after node 1 finished I began same process on node2.  Heal 
>> proces kicked in as before and the files in directories visible from mount 
>> and .glusterfs healed in short time.  Then it began crawl of .shard adding 
>> those files to heal count at which point the entire proces ground to a halt 
>> basically.  After 48 hours out of 19k shards it has added 5900 to heal list. 
>>  Load on all 3 machnes is negligible.   It was suggested to change this 
>> value to full cluster.data-self-heal-algorithm and restart volume which I 
>> did.  No efffect.  Tried relaunching heal no effect, despite any node 
>> picked.  I started each VM and performed a stat of all files from within it, 
>> or a full virus scan  and that seemed to cause short small spikes in shards 
>> added, but not by much.  Logs are showing no real messages indicating 
>> anything is going on.  I get hits to brick log on occasion of null lookups 
>> making me think its not really crawling shards directory but waiting for a 
>> shard lookup to add it.  I'll get following in brick log but not constant 
>> and sometime multiple for same shard.
>> 
>> [2016-08-29 08:31:57.478125] W [MSGID: 115009] 
>> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution type 
>> for (null) (LOOKUP)
>> [2016-08-29 08:31:57.478170] E [MSGID: 115050] 
>> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783: 
>> LOOKUP (null) (00000000-0000-0000-00
>> 00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid 
>> argument) [Invalid argument]
>> 
>> This one repeated about 30 times in row then nothing for 10 minutes then one 
>> hit for one different shard by itself.
>> 
>> How can I determine if Heal is actually running?  How can I kill it or force 
>> restart?  Does node I start it from determine which directory gets crawled 
>> to determine heals?
>> 
>> David Gossage
>> Carousel Checks Inc. | System Administrator
>> Office 708.613.2284
>>  <>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users@gluster.org <mailto:Gluster-users@gluster.org>
>> http://www.gluster.org/mailman/listinfo/gluster-users 
>> <http://www.gluster.org/mailman/listinfo/gluster-users>
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users@gluster.org <mailto:Gluster-users@gluster.org>
>> http://www.gluster.org/mailman/listinfo/gluster-users
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

Reply via email to