Re: [Gluster-users] Healing Delays - possibly solved

2016-10-19 Thread Lindsay Mathieson

On 2/10/2016 12:48 AM, Lindsay Mathieson wrote:
This was raised earlier but I don't believe it was ever resolved and 
it is becoming a serious issue for me.



I'm doing rolling upgrades on our three node cluster (Replica 3, 
Sharded, VM Workload).



I update one node, reboot it, wait for healing to complete, do the 
next one. 


Recently, I decided to remove all the heal optimisations I made to 
volume settings.


- cluster.self-heal-window-size: 1024
- cluster.background-self-heal-count: 16


Just reset them to defaults.


Since then I've done several rolling reboots and a lot of test kills of 
glusterfsd. Each time the volume heals have been prompt and fast, so I'm 
going t assume my settings were part of the problem. Perhaps the large 
heal windows size was repsonsible for the long delays before heal started?



Cheers,


--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Healing Delays

2016-10-03 Thread Pranith Kumar Karampuri
On Sun, Oct 2, 2016 at 5:49 AM, Lindsay Mathieson <
lindsay.mathie...@gmail.com> wrote:

> On 2/10/2016 12:48 AM, Lindsay Mathieson wrote:
>
>> Only the heal count does not change, it just does not seem to start. It
>> can take hours before it shifts, but once it does, its quite rapid. Node 1
>> has restarted and the heal count has been static at 511 shards for 45
>> minutes now. Nodes 1 & 2 have low CPU load, node 3 has glusterfsd pegged at
>> 800% CPU.
>>
>
> Ok, had a try at systematically reproducing it this morning and was
> actually unable to do so - quite weird. Testing was the same as last night
> - move all the VM's off a server and reboot it, wait for the healing to
> finish. This time I tried it with various different settings.
>
>
> Test 1
> --
> cluster.granular-entry-heal no
> cluster.locking-scheme full
> Shards / Min: 350 / 8
>
>
> Test 2
> --
> cluster.granular-entry-heal yes
> cluster.locking-scheme granular
> Shards / Min:  391 / 10
>
> Test 3
> --
> cluster.granular-entry-heal yes
> cluster.locking-scheme granular
> heal command issued
> Shards / Min: 358 / 11
>
> Test 3
> --
> cluster.granular-entry-heal yes
> cluster.locking-scheme granular
> heal full command issued
> Shards / Min: 358 / 27
>
>
> Best results were with cluster.granular-entry-heal=yes,
> cluster.locking-scheme=granular but they were all quite good.
>
>
> Don't know why it was so much worse last night - i/o load, cpu and memory
> were the same. However one thin that is different which I can't easily
> reproduce was that the cluster had been running for several weeks, but last
> night I rebooted all nodes. Could gluster be developing an issue after
> running for some time?


>From the algorithm point of view, the only thing that matters is load that
it needs to heal. Doesn't depend on age. So whether the load to heal is
100GB in very less time or in few months, the time to heal should be same.


>
>
>
> --
> Lindsay Mathieson
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>



-- 
Pranith
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Healing Delays

2016-10-01 Thread Lindsay Mathieson

On 2/10/2016 12:48 AM, Lindsay Mathieson wrote:
Only the heal count does not change, it just does not seem to start. 
It can take hours before it shifts, but once it does, its quite rapid. 
Node 1 has restarted and the heal count has been static at 511 shards 
for 45 minutes now. Nodes 1 & 2 have low CPU load, node 3 has 
glusterfsd pegged at 800% CPU. 


Ok, had a try at systematically reproducing it this morning and was 
actually unable to do so - quite weird. Testing was the same as last 
night - move all the VM's off a server and reboot it, wait for the 
healing to finish. This time I tried it with various different settings.



Test 1
--
cluster.granular-entry-heal no
cluster.locking-scheme full
Shards / Min: 350 / 8


Test 2
--
cluster.granular-entry-heal yes
cluster.locking-scheme granular
Shards / Min:  391 / 10

Test 3
--
cluster.granular-entry-heal yes
cluster.locking-scheme granular
heal command issued
Shards / Min: 358 / 11

Test 3
--
cluster.granular-entry-heal yes
cluster.locking-scheme granular
heal full command issued
Shards / Min: 358 / 27


Best results were with cluster.granular-entry-heal=yes, 
cluster.locking-scheme=granular but they were all quite good.



Don't know why it was so much worse last night - i/o load, cpu and 
memory were the same. However one thin that is different which I can't 
easily reproduce was that the cluster had been running for several 
weeks, but last night I rebooted all nodes. Could gluster be developing 
an issue after running for some time?



--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Healing Delays

2016-10-01 Thread Lindsay Mathieson

On 2/10/2016 12:48 AM, Lindsay Mathieson wrote:

511 shards for 45 minutes


At (roughly) the one hour mark it started ticking over, heal completed 
at 1.5 hours.


--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Healing Delays

2016-10-01 Thread Krutika Dhananjay
Any errors/warnings in the glustershd logs?

-Krutika

On Sat, Oct 1, 2016 at 8:18 PM, Lindsay Mathieson <
lindsay.mathie...@gmail.com> wrote:

> This was raised earlier but I don't believe it was ever resolved and it is
> becoming a serious issue for me.
>
>
> I'm doing rolling upgrades on our three node cluster (Replica 3, Sharded,
> VM Workload).
>
>
> I update one node, reboot it, wait for healing to complete, do the next
> one.
>
>
> Only the heal count does not change, it just does not seem to start. It
> can take hours before it shifts, but once it does, its quite rapid. Node 1
> has restarted and the heal count has been static at 511 shards for 45
> minutes now. Nodes 1 & 2 have low CPU load, node 3 has glusterfsd pegged at
> 800% CPU.
>
>
> This was *not* the case in earlier versions of gluster (3.7.11 I think),
> healing would start almost right away. I think it started doing this when
> the afr locking improvements where made.
>
>
> I have experimented with full & diff heal modes, doesn't make any
> difference.
>
> Current:
>
> Gluster Version 4.8.4
>
> Volume Name: datastore4
> Type: Replicate
> Volume ID: 0ba131ef-311d-4bb1-be46-596e83b2f6ce
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: vnb.proxmox.softlog:/tank/vmdata/datastore4
> Brick2: vng.proxmox.softlog:/tank/vmdata/datastore4
> Brick3: vna.proxmox.softlog:/tank/vmdata/datastore4
> Options Reconfigured:
> cluster.self-heal-window-size: 1024
> cluster.locking-scheme: granular
> cluster.granular-entry-heal: on
> performance.readdir-ahead: on
> cluster.data-self-heal: on
> features.shard: on
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> nfs.disable: on
> nfs.addr-namelookup: off
> nfs.enable-ino32: off
> performance.strict-write-ordering: off
> performance.stat-prefetch: on
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> cluster.eager-lock: enable
> network.remote-dio: enable
> features.shard-block-size: 64MB
> cluster.background-self-heal-count: 16
>
>
> Thanks,
>
>
>
>
>
> --
> Lindsay Mathieson
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Healing Delays

2016-10-01 Thread Lindsay Mathieson
This was raised earlier but I don't believe it was ever resolved and it 
is becoming a serious issue for me.



I'm doing rolling upgrades on our three node cluster (Replica 3, 
Sharded, VM Workload).



I update one node, reboot it, wait for healing to complete, do the next one.


Only the heal count does not change, it just does not seem to start. It 
can take hours before it shifts, but once it does, its quite rapid. Node 
1 has restarted and the heal count has been static at 511 shards for 45 
minutes now. Nodes 1 & 2 have low CPU load, node 3 has glusterfsd pegged 
at 800% CPU.



This was *not* the case in earlier versions of gluster (3.7.11 I think), 
healing would start almost right away. I think it started doing this 
when the afr locking improvements where made.



I have experimented with full & diff heal modes, doesn't make any 
difference.


Current:

Gluster Version 4.8.4

Volume Name: datastore4
Type: Replicate
Volume ID: 0ba131ef-311d-4bb1-be46-596e83b2f6ce
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: vnb.proxmox.softlog:/tank/vmdata/datastore4
Brick2: vng.proxmox.softlog:/tank/vmdata/datastore4
Brick3: vna.proxmox.softlog:/tank/vmdata/datastore4
Options Reconfigured:
cluster.self-heal-window-size: 1024
cluster.locking-scheme: granular
cluster.granular-entry-heal: on
performance.readdir-ahead: on
cluster.data-self-heal: on
features.shard: on
cluster.quorum-type: auto
cluster.server-quorum-type: server
nfs.disable: on
nfs.addr-namelookup: off
nfs.enable-ino32: off
performance.strict-write-ordering: off
performance.stat-prefetch: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
cluster.eager-lock: enable
network.remote-dio: enable
features.shard-block-size: 64MB
cluster.background-self-heal-count: 16


Thanks,





--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users