Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-09-06 Thread Pranith Kumar Karampuri
 that may be why sweep on 2
>>>>>>>>>>> other bricks haven't replied as finished.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> my hope is that later tonight a full heal will work on
>>>>>>>>>>>> production.  Is it possible self-heal daemon can get stale or stop
>>>>>>>>>>>> listening but still show as active?  Would stopping and starting 
>>>>>>>>>>>> self-heal
>>>>>>>>>>>> daemon from gluster cli before doing these heals be helpful?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Aug 30, 2016 at 9:29 AM, David Gossage <
>>>>>>>>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Aug 30, 2016 at 8:52 AM, David Gossage <
>>>>>>>>>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Aug 30, 2016 at 8:01 AM, Krutika Dhananjay <
>>>>>>>>>>>>>> kdhan...@redhat.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Aug 30, 2016 at 6:20 PM, Krutika Dhananjay <
>>>>>>>>>>>>>>> kdhan...@redhat.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Aug 30, 2016 at 6:07 PM, David Gossage <
>>>>>>>>>>>>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Aug 30, 2016 at 7:18 AM, Krutika Dhananjay <
>>>>>>>>>>>>>>>>> kdhan...@redhat.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Could you also share the glustershd logs?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'll get them when I get to work sure
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I tried the same steps that you mentioned multiple times,
>>>>>>>>>>>>>>>>>> but heal is running to completion without any issues.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> It must be said that 'heal full' traverses the files and
>>>>>>>>>>>>>>>>>> directories in a depth-first order and does heals also in 
>>>>>>>>>>>>>>>>>> the same order.
>>>>>>>>>>>>>>>>>> But if it gets interrupted in the middle (say because 
>>>>>>>>>>>>>>>>>> self-heal-daemon was
>>>>>>>>>>>>>>>>>> either intentionally or unintentionally brought offline and 
>>>>>>>>>>>>>>>>>> then brought
>>>>>>>>>>>>>>>>>> back up), self-heal will only pick up the entries that are 
>>>>>>>>>>>>>>>>>> so far marked as
>>>>>>>>>>>>>>>>>> new-entries that need heal which it will find in 
>>>>>>>>>>>>>>>>>> indices/xattrop directory.
>>>>>>>>>>>>>>>>>> Wha

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-09-06 Thread David Gossage
gt;>> my hope is that later tonight a full heal will work on
>>>>>>>>>>>> production.  Is it possible self-heal daemon can get stale or stop
>>>>>>>>>>>> listening but still show as active?  Would stopping and starting 
>>>>>>>>>>>> self-heal
>>>>>>>>>>>> daemon from gluster cli before doing these heals be helpful?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Aug 30, 2016 at 9:29 AM, David Gossage <
>>>>>>>>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Aug 30, 2016 at 8:52 AM, David Gossage <
>>>>>>>>>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Aug 30, 2016 at 8:01 AM, Krutika Dhananjay <
>>>>>>>>>>>>>> kdhan...@redhat.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Aug 30, 2016 at 6:20 PM, Krutika Dhananjay <
>>>>>>>>>>>>>>> kdhan...@redhat.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Aug 30, 2016 at 6:07 PM, David Gossage <
>>>>>>>>>>>>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Aug 30, 2016 at 7:18 AM, Krutika Dhananjay <
>>>>>>>>>>>>>>>>> kdhan...@redhat.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Could you also share the glustershd logs?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'll get them when I get to work sure
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I tried the same steps that you mentioned multiple times,
>>>>>>>>>>>>>>>>>> but heal is running to completion without any issues.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> It must be said that 'heal full' traverses the files and
>>>>>>>>>>>>>>>>>> directories in a depth-first order and does heals also in 
>>>>>>>>>>>>>>>>>> the same order.
>>>>>>>>>>>>>>>>>> But if it gets interrupted in the middle (say because 
>>>>>>>>>>>>>>>>>> self-heal-daemon was
>>>>>>>>>>>>>>>>>> either intentionally or unintentionally brought offline and 
>>>>>>>>>>>>>>>>>> then brought
>>>>>>>>>>>>>>>>>> back up), self-heal will only pick up the entries that are 
>>>>>>>>>>>>>>>>>> so far marked as
>>>>>>>>>>>>>>>>>> new-entries that need heal which it will find in 
>>>>>>>>>>>>>>>>>> indices/xattrop directory.
>>>>>>>>>>>>>>>>>> What this means is that those files and directories that 
>>>>>>>>>>>>>>>>>> were not visited
>>>>>>>>>>>>>>>>>> during the crawl, will remain untouched and unhealed in this 
>>>>>&

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-09-06 Thread Krutika Dhananjay
t;>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'll get them when I get to work sure
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I tried the same steps that you mentioned multiple times,
>>>>>>>>>>>>>>>>> but heal is running to completion without any issues.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It must be said that 'heal full' traverses the files and
>>>>>>>>>>>>>>>>> directories in a depth-first order and does heals also in the 
>>>>>>>>>>>>>>>>> same order.
>>>>>>>>>>>>>>>>> But if it gets interrupted in the middle (say because 
>>>>>>>>>>>>>>>>> self-heal-daemon was
>>>>>>>>>>>>>>>>> either intentionally or unintentionally brought offline and 
>>>>>>>>>>>>>>>>> then brought
>>>>>>>>>>>>>>>>> back up), self-heal will only pick up the entries that are so 
>>>>>>>>>>>>>>>>> far marked as
>>>>>>>>>>>>>>>>> new-entries that need heal which it will find in 
>>>>>>>>>>>>>>>>> indices/xattrop directory.
>>>>>>>>>>>>>>>>> What this means is that those files and directories that were 
>>>>>>>>>>>>>>>>> not visited
>>>>>>>>>>>>>>>>> during the crawl, will remain untouched and unhealed in this 
>>>>>>>>>>>>>>>>> second
>>>>>>>>>>>>>>>>> iteration of heal, unless you execute a 'heal-full' again.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So should it start healing shards as it crawls or not until
>>>>>>>>>>>>>>>> after it crawls the entire .shard directory?  At the pace it 
>>>>>>>>>>>>>>>> was going that
>>>>>>>>>>>>>>>> could be a week with one node appearing in the cluster but 
>>>>>>>>>>>>>>>> with no shard
>>>>>>>>>>>>>>>> files if anything tries to access a file on that node.  From 
>>>>>>>>>>>>>>>> my experience
>>>>>>>>>>>>>>>> other day telling it to heal full again did nothing regardless 
>>>>>>>>>>>>>>>> of node used.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Crawl is started from '/' of the volume. Whenever self-heal
>>>>>>>>>>>>>> detects during the crawl that a file or directory is present in 
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>> brick(s) and absent in others, it creates the file on the bricks 
>>>>>>>>>>>>>> where it
>>>>>>>>>>>>>> is absent and marks the fact that the file or directory might 
>>>>>>>>>>>>>> need
>>>>>>>>>>>>>> data/entry and metadata heal too (this also means that an index 
>>>>>>>>>>>>>> is created
>>>>>>>>>>>>>> under .glusterfs/indices/xattrop of the src bricks). And the 
>>>>>>>>>>>>>> data/entry and
>>>>>>>>>>>>>> metadata heal are picked up and done in
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the backgro

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-09-06 Thread David Gossage
before doing these heals be helpful?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Aug 30, 2016 at 9:29 AM, David Gossage <
>>>>>>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Tue, Aug 30, 2016 at 8:52 AM, David Gossage <
>>>>>>>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Aug 30, 2016 at 8:01 AM, Krutika Dhananjay <
>>>>>>>>>>>> kdhan...@redhat.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Aug 30, 2016 at 6:20 PM, Krutika Dhananjay <
>>>>>>>>>>>>> kdhan...@redhat.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Aug 30, 2016 at 6:07 PM, David Gossage <
>>>>>>>>>>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Aug 30, 2016 at 7:18 AM, Krutika Dhananjay <
>>>>>>>>>>>>>>> kdhan...@redhat.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Could you also share the glustershd logs?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'll get them when I get to work sure
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I tried the same steps that you mentioned multiple times,
>>>>>>>>>>>>>>>> but heal is running to completion without any issues.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It must be said that 'heal full' traverses the files and
>>>>>>>>>>>>>>>> directories in a depth-first order and does heals also in the 
>>>>>>>>>>>>>>>> same order.
>>>>>>>>>>>>>>>> But if it gets interrupted in the middle (say because 
>>>>>>>>>>>>>>>> self-heal-daemon was
>>>>>>>>>>>>>>>> either intentionally or unintentionally brought offline and 
>>>>>>>>>>>>>>>> then brought
>>>>>>>>>>>>>>>> back up), self-heal will only pick up the entries that are so 
>>>>>>>>>>>>>>>> far marked as
>>>>>>>>>>>>>>>> new-entries that need heal which it will find in 
>>>>>>>>>>>>>>>> indices/xattrop directory.
>>>>>>>>>>>>>>>> What this means is that those files and directories that were 
>>>>>>>>>>>>>>>> not visited
>>>>>>>>>>>>>>>> during the crawl, will remain untouched and unhealed in this 
>>>>>>>>>>>>>>>> second
>>>>>>>>>>>>>>>> iteration of heal, unless you execute a 'heal-full' again.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So should it start healing shards as it crawls or not until
>>>>>>>>>>>>>>> after it crawls the entire .shard directory?  At the pace it 
>>>>>>>>>>>>>>> was going that
>>>>>>>>>>>>>>> could be a week with one node appearing in the cluster but with 
>>>>>>>>>>>>>>> no shard
>&

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-09-01 Thread David Gossage
;>>>>>>>>>>>> during the crawl, will remain untouched and unhealed in this 
>>>>>>>>>>>>>>> second
>>>>>>>>>>>>>>> iteration of heal, unless you execute a 'heal-full' again.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So should it start healing shards as it crawls or not until
>>>>>>>>>>>>>> after it crawls the entire .shard directory?  At the pace it was 
>>>>>>>>>>>>>> going that
>>>>>>>>>>>>>> could be a week with one node appearing in the cluster but with 
>>>>>>>>>>>>>> no shard
>>>>>>>>>>>>>> files if anything tries to access a file on that node.  From my 
>>>>>>>>>>>>>> experience
>>>>>>>>>>>>>> other day telling it to heal full again did nothing regardless 
>>>>>>>>>>>>>> of node used.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> Crawl is started from '/' of the volume. Whenever self-heal
>>>>>>>>>>>> detects during the crawl that a file or directory is present in 
>>>>>>>>>>>> some
>>>>>>>>>>>> brick(s) and absent in others, it creates the file on the bricks 
>>>>>>>>>>>> where it
>>>>>>>>>>>> is absent and marks the fact that the file or directory might need
>>>>>>>>>>>> data/entry and metadata heal too (this also means that an index is 
>>>>>>>>>>>> created
>>>>>>>>>>>> under .glusterfs/indices/xattrop of the src bricks). And the 
>>>>>>>>>>>> data/entry and
>>>>>>>>>>>> metadata heal are picked up and done in
>>>>>>>>>>>>
>>>>>>>>>>> the background with the help of these indices.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Looking at my 3rd node as example i find nearly an exact same
>>>>>>>>>>> number of files in xattrop dir as reported by heal count at time I 
>>>>>>>>>>> brought
>>>>>>>>>>> down node2 to try and alleviate read io errors that seemed to occur 
>>>>>>>>>>> from
>>>>>>>>>>> what I was guessing as attempts to use the node with no shards for 
>>>>>>>>>>> reads.
>>>>>>>>>>>
>>>>>>>>>>> Also attached are the glustershd logs from the 3 nodes, along
>>>>>>>>>>> with the test node i tried yesterday with same results.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Looking at my own logs I notice that a full sweep was only ever
>>>>>>>>>> recorded in glustershd.log on 2nd node with missing directory.  I 
>>>>>>>>>> believe I
>>>>>>>>>> should have found a sweep begun on every node correct?
>>>>>>>>>>
>>>>>>>>>> On my test dev when it did work I do see that
>>>>>>>>>>
>>>>>>>>>> [2016-08-30 13:56:25.22] I [MSGID: 108026]
>>>>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>>>>> 0-glustershard-replicate-0: starting full sweep on subvol
>>>>>>>>>> glustershard-client-0
>>>>>>>>>> [2016-08-30 13:56:25.223522] I [MSGID: 108026]
>>>>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>>>>> 0-glustershard-replicate-0: starting full sweep on subvol
>>>>>>>>>> glustershard-client-1
>>>>>>>>>> [2016-08-30 13:56:25.224616] I [MSGID: 108026]
>>>>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>>>>> 0-glustershard-replicat

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-31 Thread Krutika Dhananjay
t;>>>>>>>>
>>>>>>>>>> Looking at my 3rd node as example i find nearly an exact same
>>>>>>>>>> number of files in xattrop dir as reported by heal count at time I 
>>>>>>>>>> brought
>>>>>>>>>> down node2 to try and alleviate read io errors that seemed to occur 
>>>>>>>>>> from
>>>>>>>>>> what I was guessing as attempts to use the node with no shards for 
>>>>>>>>>> reads.
>>>>>>>>>>
>>>>>>>>>> Also attached are the glustershd logs from the 3 nodes, along
>>>>>>>>>> with the test node i tried yesterday with same results.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Looking at my own logs I notice that a full sweep was only ever
>>>>>>>>> recorded in glustershd.log on 2nd node with missing directory.  I 
>>>>>>>>> believe I
>>>>>>>>> should have found a sweep begun on every node correct?
>>>>>>>>>
>>>>>>>>> On my test dev when it did work I do see that
>>>>>>>>>
>>>>>>>>> [2016-08-30 13:56:25.22] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>>>> 0-glustershard-replicate-0: starting full sweep on subvol
>>>>>>>>> glustershard-client-0
>>>>>>>>> [2016-08-30 13:56:25.223522] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>>>> 0-glustershard-replicate-0: starting full sweep on subvol
>>>>>>>>> glustershard-client-1
>>>>>>>>> [2016-08-30 13:56:25.224616] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>>>> 0-glustershard-replicate-0: starting full sweep on subvol
>>>>>>>>> glustershard-client-2
>>>>>>>>> [2016-08-30 14:18:48.333740] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:656:afr_shd_full_healer]
>>>>>>>>> 0-glustershard-replicate-0: finished full sweep on subvol
>>>>>>>>> glustershard-client-2
>>>>>>>>> [2016-08-30 14:18:48.356008] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:656:afr_shd_full_healer]
>>>>>>>>> 0-glustershard-replicate-0: finished full sweep on subvol
>>>>>>>>> glustershard-client-1
>>>>>>>>> [2016-08-30 14:18:49.637811] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:656:afr_shd_full_healer]
>>>>>>>>> 0-glustershard-replicate-0: finished full sweep on subvol
>>>>>>>>> glustershard-client-0
>>>>>>>>>
>>>>>>>>> While when looking at past few days of the 3 prod nodes i only
>>>>>>>>> found that on my 2nd node
>>>>>>>>> [2016-08-27 01:26:42.638772] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>>>> 0-GLUSTER1-replicate-0: starting full sweep on subvol 
>>>>>>>>> GLUSTER1-client-1
>>>>>>>>> [2016-08-27 11:37:01.732366] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:656:afr_shd_full_healer]
>>>>>>>>> 0-GLUSTER1-replicate-0: finished full sweep on subvol 
>>>>>>>>> GLUSTER1-client-1
>>>>>>>>> [2016-08-27 12:58:34.597228] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>>>> 0-GLUSTER1-replicate-0: starting full sweep on subvol 
>>>>>>>>> GLUSTER1-client-1
>>>>>>>>> [2016-08-27 12:59:28.041173] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:656:afr_shd_full_healer]
>>>>>>>>> 0-GLUSTER1-replicate-0: finished full sweep on subvol 
>>>>>>>>> GLUSTER1-client-1
>>>>>>>>> [2016-08-27 20:03:42.560188] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>>>> 0-GLUSTER1-replica

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-31 Thread David Gossage
;>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> So should it start healing shards as it crawls or not until
>>>>>>>>>>>>> after it crawls the entire .shard directory?  At the pace it was 
>>>>>>>>>>>>> going that
>>>>>>>>>>>>> could be a week with one node appearing in the cluster but with 
>>>>>>>>>>>>> no shard
>>>>>>>>>>>>> files if anything tries to access a file on that node.  From my 
>>>>>>>>>>>>> experience
>>>>>>>>>>>>> other day telling it to heal full again did nothing regardless of 
>>>>>>>>>>>>> node used.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> Crawl is started from '/' of the volume. Whenever self-heal
>>>>>>>>>>> detects during the crawl that a file or directory is present in some
>>>>>>>>>>> brick(s) and absent in others, it creates the file on the bricks 
>>>>>>>>>>> where it
>>>>>>>>>>> is absent and marks the fact that the file or directory might need
>>>>>>>>>>> data/entry and metadata heal too (this also means that an index is 
>>>>>>>>>>> created
>>>>>>>>>>> under .glusterfs/indices/xattrop of the src bricks). And the 
>>>>>>>>>>> data/entry and
>>>>>>>>>>> metadata heal are picked up and done in
>>>>>>>>>>>
>>>>>>>>>> the background with the help of these indices.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Looking at my 3rd node as example i find nearly an exact same
>>>>>>>>>> number of files in xattrop dir as reported by heal count at time I 
>>>>>>>>>> brought
>>>>>>>>>> down node2 to try and alleviate read io errors that seemed to occur 
>>>>>>>>>> from
>>>>>>>>>> what I was guessing as attempts to use the node with no shards for 
>>>>>>>>>> reads.
>>>>>>>>>>
>>>>>>>>>> Also attached are the glustershd logs from the 3 nodes, along
>>>>>>>>>> with the test node i tried yesterday with same results.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Looking at my own logs I notice that a full sweep was only ever
>>>>>>>>> recorded in glustershd.log on 2nd node with missing directory.  I 
>>>>>>>>> believe I
>>>>>>>>> should have found a sweep begun on every node correct?
>>>>>>>>>
>>>>>>>>> On my test dev when it did work I do see that
>>>>>>>>>
>>>>>>>>> [2016-08-30 13:56:25.22] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>>>> 0-glustershard-replicate-0: starting full sweep on subvol
>>>>>>>>> glustershard-client-0
>>>>>>>>> [2016-08-30 13:56:25.223522] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>>>> 0-glustershard-replicate-0: starting full sweep on subvol
>>>>>>>>> glustershard-client-1
>>>>>>>>> [2016-08-30 13:56:25.224616] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>>>> 0-glustershard-replicate-0: starting full sweep on subvol
>>>>>>>>> glustershard-client-2
>>>>>>>>> [2016-08-30 14:18:48.333740] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:656:afr_shd_full_healer]
>>>>>>>>> 0-glustershard-replicate-0: finished full sweep on subvol
>>>>>>>>> glustershard-client-2
>>>>>>>>> [2016-08-30 14:18:48.356008] I [MSGID: 108026]
>>>>>>>>> [afr-self-heald.c:656:afr_shd_fu

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-31 Thread David Gossage
ntouched and unhealed in this 
>>>>>>>>>>>>> second
>>>>>>>>>>>>> iteration of heal, unless you execute a 'heal-full' again.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> So should it start healing shards as it crawls or not until
>>>>>>>>>>>> after it crawls the entire .shard directory?  At the pace it was 
>>>>>>>>>>>> going that
>>>>>>>>>>>> could be a week with one node appearing in the cluster but with no 
>>>>>>>>>>>> shard
>>>>>>>>>>>> files if anything tries to access a file on that node.  From my 
>>>>>>>>>>>> experience
>>>>>>>>>>>> other day telling it to heal full again did nothing regardless of 
>>>>>>>>>>>> node used.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> Crawl is started from '/' of the volume. Whenever self-heal
>>>>>>>>>> detects during the crawl that a file or directory is present in some
>>>>>>>>>> brick(s) and absent in others, it creates the file on the bricks 
>>>>>>>>>> where it
>>>>>>>>>> is absent and marks the fact that the file or directory might need
>>>>>>>>>> data/entry and metadata heal too (this also means that an index is 
>>>>>>>>>> created
>>>>>>>>>> under .glusterfs/indices/xattrop of the src bricks). And the 
>>>>>>>>>> data/entry and
>>>>>>>>>> metadata heal are picked up and done in
>>>>>>>>>>
>>>>>>>>> the background with the help of these indices.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Looking at my 3rd node as example i find nearly an exact same
>>>>>>>>> number of files in xattrop dir as reported by heal count at time I 
>>>>>>>>> brought
>>>>>>>>> down node2 to try and alleviate read io errors that seemed to occur 
>>>>>>>>> from
>>>>>>>>> what I was guessing as attempts to use the node with no shards for 
>>>>>>>>> reads.
>>>>>>>>>
>>>>>>>>> Also attached are the glustershd logs from the 3 nodes, along with
>>>>>>>>> the test node i tried yesterday with same results.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Looking at my own logs I notice that a full sweep was only ever
>>>>>>>> recorded in glustershd.log on 2nd node with missing directory.  I 
>>>>>>>> believe I
>>>>>>>> should have found a sweep begun on every node correct?
>>>>>>>>
>>>>>>>> On my test dev when it did work I do see that
>>>>>>>>
>>>>>>>> [2016-08-30 13:56:25.22] I [MSGID: 108026]
>>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>>> 0-glustershard-replicate-0: starting full sweep on subvol
>>>>>>>> glustershard-client-0
>>>>>>>> [2016-08-30 13:56:25.223522] I [MSGID: 108026]
>>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>>> 0-glustershard-replicate-0: starting full sweep on subvol
>>>>>>>> glustershard-client-1
>>>>>>>> [2016-08-30 13:56:25.224616] I [MSGID: 108026]
>>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>>> 0-glustershard-replicate-0: starting full sweep on subvol
>>>>>>>> glustershard-client-2
>>>>>>>> [2016-08-30 14:18:48.333740] I [MSGID: 108026]
>>>>>>>> [afr-self-heald.c:656:afr_shd_full_healer]
>>>>>>>> 0-glustershard-replicate-0: finished full sweep on subvol
>>>>>>>> glustershard-client-2
>>>>>>>> [2016-08-30 14:18:48.356008] I [MSGID: 108026]
>>>>>>>> [afr-self-heald.c:656:afr_shd_full_healer]
>>>>>>&g

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-31 Thread David Gossage
;>>>> [2016-08-27 12:59:28.041173] I [MSGID: 108026]
>>>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>> finished full sweep on subvol GLUSTER1-client-1
>>>>> [2016-08-27 20:03:42.560188] I [MSGID: 108026]
>>>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>> starting full sweep on subvol GLUSTER1-client-1
>>>>> [2016-08-27 20:03:44.278274] I [MSGID: 108026]
>>>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>> finished full sweep on subvol GLUSTER1-client-1
>>>>> [2016-08-27 21:00:42.603315] I [MSGID: 108026]
>>>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>> starting full sweep on subvol GLUSTER1-client-1
>>>>> [2016-08-27 21:00:46.148674] I [MSGID: 108026]
>>>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>> finished full sweep on subvol GLUSTER1-client-1
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>> My suspicion is that this is what happened on your setup. Could
>>>>>>>>>> you confirm if that was the case?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Brick was brought online with force start then a full heal
>>>>>>>>> launched.  Hours later after it became evident that it was not adding 
>>>>>>>>> new
>>>>>>>>> files to heal I did try restarting self-heal daemon and relaunching 
>>>>>>>>> full
>>>>>>>>> heal again. But this was after the heal had basically already failed 
>>>>>>>>> to
>>>>>>>>> work as intended.
>>>>>>>>>
>>>>>>>>
>>>>>>>> OK. How did you figure it was not adding any new files? I need to
>>>>>>>> know what places you were monitoring to come to this conclusion.
>>>>>>>>
>>>>>>>> -Krutika
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> As for those logs, I did manager to do something that caused
>>>>>>>>>> these warning messages you shared earlier to appear in my client and 
>>>>>>>>>> server
>>>>>>>>>> logs.
>>>>>>>>>> Although these logs are annoying and a bit scary too, they didn't
>>>>>>>>>> do any harm to the data in my volume. Why they appear just after a 
>>>>>>>>>> brick is
>>>>>>>>>> replaced and under no other circumstances is something I'm still
>>>>>>>>>> investigating.
>>>>>>>>>>
>>>>>>>>>> But for future, it would be good to follow the steps Anuradha
>>>>>>>>>> gave as that would allow self-heal to at least detect that it has 
>>>>>>>>>> some
>>>>>>>>>> repairing to do whenever it is restarted whether intentionally or 
>>>>>>>>>> otherwise.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I followed those steps as described on my test box and ended up
>>>>>>>>> with exact same outcome of adding shards at an agonizing slow pace 
>>>>>>>>> and no
>>>>>>>>> creation of .shard directory or heals on shard directory.  Directories
>>>>>>>>> visible from mount healed quickly.  This was with one VM so it has 
>>>>>>>>> only 800
>>>>>>>>> shards as well.  After hours at work it had added a total of 33 
>>>>>>>>> shards to
>>>>>>>>> be healed.  I sent those logs yesterday as well though not the 
>>>>>>>>> glustershd.
>>>>>>>>>
>>>>>>>>> Does replace-brick command copy files in same manner?  For these
>>>>>>>>> purposes I am contemplating just skipping the heal route.
>>>>>>

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-31 Thread Krutika Dhananjay
gt;>>>> should have found a sweep begun on every node correct?
>>>>>>>
>>>>>>> On my test dev when it did work I do see that
>>>>>>>
>>>>>>> [2016-08-30 13:56:25.22] I [MSGID: 108026]
>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>> 0-glustershard-replicate-0: starting full sweep on subvol
>>>>>>> glustershard-client-0
>>>>>>> [2016-08-30 13:56:25.223522] I [MSGID: 108026]
>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>> 0-glustershard-replicate-0: starting full sweep on subvol
>>>>>>> glustershard-client-1
>>>>>>> [2016-08-30 13:56:25.224616] I [MSGID: 108026]
>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer]
>>>>>>> 0-glustershard-replicate-0: starting full sweep on subvol
>>>>>>> glustershard-client-2
>>>>>>> [2016-08-30 14:18:48.333740] I [MSGID: 108026]
>>>>>>> [afr-self-heald.c:656:afr_shd_full_healer]
>>>>>>> 0-glustershard-replicate-0: finished full sweep on subvol
>>>>>>> glustershard-client-2
>>>>>>> [2016-08-30 14:18:48.356008] I [MSGID: 108026]
>>>>>>> [afr-self-heald.c:656:afr_shd_full_healer]
>>>>>>> 0-glustershard-replicate-0: finished full sweep on subvol
>>>>>>> glustershard-client-1
>>>>>>> [2016-08-30 14:18:49.637811] I [MSGID: 108026]
>>>>>>> [afr-self-heald.c:656:afr_shd_full_healer]
>>>>>>> 0-glustershard-replicate-0: finished full sweep on subvol
>>>>>>> glustershard-client-0
>>>>>>>
>>>>>>> While when looking at past few days of the 3 prod nodes i only found
>>>>>>> that on my 2nd node
>>>>>>> [2016-08-27 01:26:42.638772] I [MSGID: 108026]
>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>>> starting full sweep on subvol GLUSTER1-client-1
>>>>>>> [2016-08-27 11:37:01.732366] I [MSGID: 108026]
>>>>>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>>> finished full sweep on subvol GLUSTER1-client-1
>>>>>>> [2016-08-27 12:58:34.597228] I [MSGID: 108026]
>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>>> starting full sweep on subvol GLUSTER1-client-1
>>>>>>> [2016-08-27 12:59:28.041173] I [MSGID: 108026]
>>>>>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>>> finished full sweep on subvol GLUSTER1-client-1
>>>>>>> [2016-08-27 20:03:42.560188] I [MSGID: 108026]
>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>>> starting full sweep on subvol GLUSTER1-client-1
>>>>>>> [2016-08-27 20:03:44.278274] I [MSGID: 108026]
>>>>>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>>> finished full sweep on subvol GLUSTER1-client-1
>>>>>>> [2016-08-27 21:00:42.603315] I [MSGID: 108026]
>>>>>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>>> starting full sweep on subvol GLUSTER1-client-1
>>>>>>> [2016-08-27 21:00:46.148674] I [MSGID: 108026]
>>>>>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>>> finished full sweep on subvol GLUSTER1-client-1
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> My suspicion is that this is what happened on your setup. Could
>>>>>>>>>>>> you confirm if that was the case?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Brick was brought online with force start then a full heal
>>>>>>>>>>> launched.  Hours later after it became evident that it was not 
>>>>>>>>>>> adding new
>>>>>>>>>>> files to heal I did try

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-31 Thread Krutika Dhananjay
gt;> While when looking at past few days of the 3 prod nodes i only found
>>>>>> that on my 2nd node
>>>>>> [2016-08-27 01:26:42.638772] I [MSGID: 108026]
>>>>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>> starting full sweep on subvol GLUSTER1-client-1
>>>>>> [2016-08-27 11:37:01.732366] I [MSGID: 108026]
>>>>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>> finished full sweep on subvol GLUSTER1-client-1
>>>>>> [2016-08-27 12:58:34.597228] I [MSGID: 108026]
>>>>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>> starting full sweep on subvol GLUSTER1-client-1
>>>>>> [2016-08-27 12:59:28.041173] I [MSGID: 108026]
>>>>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>> finished full sweep on subvol GLUSTER1-client-1
>>>>>> [2016-08-27 20:03:42.560188] I [MSGID: 108026]
>>>>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>> starting full sweep on subvol GLUSTER1-client-1
>>>>>> [2016-08-27 20:03:44.278274] I [MSGID: 108026]
>>>>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>> finished full sweep on subvol GLUSTER1-client-1
>>>>>> [2016-08-27 21:00:42.603315] I [MSGID: 108026]
>>>>>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>> starting full sweep on subvol GLUSTER1-client-1
>>>>>> [2016-08-27 21:00:46.148674] I [MSGID: 108026]
>>>>>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>>>>>> finished full sweep on subvol GLUSTER1-client-1
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> My suspicion is that this is what happened on your setup. Could
>>>>>>>>>>> you confirm if that was the case?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Brick was brought online with force start then a full heal
>>>>>>>>>> launched.  Hours later after it became evident that it was not 
>>>>>>>>>> adding new
>>>>>>>>>> files to heal I did try restarting self-heal daemon and relaunching 
>>>>>>>>>> full
>>>>>>>>>> heal again. But this was after the heal had basically already failed 
>>>>>>>>>> to
>>>>>>>>>> work as intended.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> OK. How did you figure it was not adding any new files? I need to
>>>>>>>>> know what places you were monitoring to come to this conclusion.
>>>>>>>>>
>>>>>>>>> -Krutika
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> As for those logs, I did manager to do something that caused
>>>>>>>>>>> these warning messages you shared earlier to appear in my client 
>>>>>>>>>>> and server
>>>>>>>>>>> logs.
>>>>>>>>>>> Although these logs are annoying and a bit scary too, they
>>>>>>>>>>> didn't do any harm to the data in my volume. Why they appear just 
>>>>>>>>>>> after a
>>>>>>>>>>> brick is replaced and under no other circumstances is something I'm 
>>>>>>>>>>> still
>>>>>>>>>>> investigating.
>>>>>>>>>>>
>>>>>>>>>>> But for future, it would be good to follow the steps Anuradha
>>>>>>>>>>> gave as that would allow self-heal to at least detect that it has 
>>>>>>>>>>> some
>>>>>>>>>>> repairing to do whenever it is restarted whether intentionally or 
>>>>>>>>>>> otherwise.
>>>>>>

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-31 Thread Krutika Dhananjay
gt;
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>> My suspicion is that this is what happened on your setup. Could
>>>>>>>>>> you confirm if that was the case?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Brick was brought online with force start then a full heal
>>>>>>>>> launched.  Hours later after it became evident that it was not adding 
>>>>>>>>> new
>>>>>>>>> files to heal I did try restarting self-heal daemon and relaunching 
>>>>>>>>> full
>>>>>>>>> heal again. But this was after the heal had basically already failed 
>>>>>>>>> to
>>>>>>>>> work as intended.
>>>>>>>>>
>>>>>>>>
>>>>>>>> OK. How did you figure it was not adding any new files? I need to
>>>>>>>> know what places you were monitoring to come to this conclusion.
>>>>>>>>
>>>>>>>> -Krutika
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> As for those logs, I did manager to do something that caused
>>>>>>>>>> these warning messages you shared earlier to appear in my client and 
>>>>>>>>>> server
>>>>>>>>>> logs.
>>>>>>>>>> Although these logs are annoying and a bit scary too, they didn't
>>>>>>>>>> do any harm to the data in my volume. Why they appear just after a 
>>>>>>>>>> brick is
>>>>>>>>>> replaced and under no other circumstances is something I'm still
>>>>>>>>>> investigating.
>>>>>>>>>>
>>>>>>>>>> But for future, it would be good to follow the steps Anuradha
>>>>>>>>>> gave as that would allow self-heal to at least detect that it has 
>>>>>>>>>> some
>>>>>>>>>> repairing to do whenever it is restarted whether intentionally or 
>>>>>>>>>> otherwise.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I followed those steps as described on my test box and ended up
>>>>>>>>> with exact same outcome of adding shards at an agonizing slow pace 
>>>>>>>>> and no
>>>>>>>>> creation of .shard directory or heals on shard directory.  Directories
>>>>>>>>> visible from mount healed quickly.  This was with one VM so it has 
>>>>>>>>> only 800
>>>>>>>>> shards as well.  After hours at work it had added a total of 33 
>>>>>>>>> shards to
>>>>>>>>> be healed.  I sent those logs yesterday as well though not the 
>>>>>>>>> glustershd.
>>>>>>>>>
>>>>>>>>> Does replace-brick command copy files in same manner?  For these
>>>>>>>>> purposes I am contemplating just skipping the heal route.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -Krutika
>>>>>>>>>>
>>>>>>>>>> On Tue, Aug 30, 2016 at 2:22 AM, David Gossage <
>>>>>>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> attached brick and client logs from test machine where same
>>>>>>>>>>> behavior occurred not sure if anything new is there.  its still on 
>>>>>>>>>>> 3.8.2
>>>>>>>>>>>
>>>>>>>>>>> Number of Bricks: 1 x 3 = 3
>>>>>>>>>>> Transport-type: tcp
>>>>>>>>>>> Bricks:
>>>>>>>>>>> Brick1: 192.168.71.10:/gluster2/brick1/1
>>>>>>>>>>> Brick2: 192.168.71.11:/gluster2/brick2/1
>>>>>>>>>>> Brick3: 192.168.71.12:/gluster2/brick3/1
>>>>>&g

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-30 Thread Krutika Dhananjay
t;> new
>>>>>>>> files to heal I did try restarting self-heal daemon and relaunching 
>>>>>>>> full
>>>>>>>> heal again. But this was after the heal had basically already failed to
>>>>>>>> work as intended.
>>>>>>>>
>>>>>>>
>>>>>>> OK. How did you figure it was not adding any new files? I need to
>>>>>>> know what places you were monitoring to come to this conclusion.
>>>>>>>
>>>>>>> -Krutika
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> As for those logs, I did manager to do something that caused these
>>>>>>>>> warning messages you shared earlier to appear in my client and server 
>>>>>>>>> logs.
>>>>>>>>> Although these logs are annoying and a bit scary too, they didn't
>>>>>>>>> do any harm to the data in my volume. Why they appear just after a 
>>>>>>>>> brick is
>>>>>>>>> replaced and under no other circumstances is something I'm still
>>>>>>>>> investigating.
>>>>>>>>>
>>>>>>>>> But for future, it would be good to follow the steps Anuradha gave
>>>>>>>>> as that would allow self-heal to at least detect that it has some 
>>>>>>>>> repairing
>>>>>>>>> to do whenever it is restarted whether intentionally or otherwise.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I followed those steps as described on my test box and ended up
>>>>>>>> with exact same outcome of adding shards at an agonizing slow pace and 
>>>>>>>> no
>>>>>>>> creation of .shard directory or heals on shard directory.  Directories
>>>>>>>> visible from mount healed quickly.  This was with one VM so it has 
>>>>>>>> only 800
>>>>>>>> shards as well.  After hours at work it had added a total of 33 shards 
>>>>>>>> to
>>>>>>>> be healed.  I sent those logs yesterday as well though not the 
>>>>>>>> glustershd.
>>>>>>>>
>>>>>>>> Does replace-brick command copy files in same manner?  For these
>>>>>>>> purposes I am contemplating just skipping the heal route.
>>>>>>>>
>>>>>>>>
>>>>>>>>> -Krutika
>>>>>>>>>
>>>>>>>>> On Tue, Aug 30, 2016 at 2:22 AM, David Gossage <
>>>>>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>>>>>
>>>>>>>>>> attached brick and client logs from test machine where same
>>>>>>>>>> behavior occurred not sure if anything new is there.  its still on 
>>>>>>>>>> 3.8.2
>>>>>>>>>>
>>>>>>>>>> Number of Bricks: 1 x 3 = 3
>>>>>>>>>> Transport-type: tcp
>>>>>>>>>> Bricks:
>>>>>>>>>> Brick1: 192.168.71.10:/gluster2/brick1/1
>>>>>>>>>> Brick2: 192.168.71.11:/gluster2/brick2/1
>>>>>>>>>> Brick3: 192.168.71.12:/gluster2/brick3/1
>>>>>>>>>> Options Reconfigured:
>>>>>>>>>> cluster.locking-scheme: granular
>>>>>>>>>> performance.strict-o-direct: off
>>>>>>>>>> features.shard-block-size: 64MB
>>>>>>>>>> features.shard: on
>>>>>>>>>> server.allow-insecure: on
>>>>>>>>>> storage.owner-uid: 36
>>>>>>>>>> storage.owner-gid: 36
>>>>>>>>>> cluster.server-quorum-type: server
>>>>>>>>>> cluster.quorum-type: auto
>>>>>>>>>> network.remote-dio: on
>>>>>>>>>> cluster.eager-lock: enable
>>>>>>>>>> performance.stat-prefetch: off
>>>>>>>>>> performance.io-cache: off
>>>>>>>>>> performance.quick-read: off
>>>>>>>>>> cluster.se

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-30 Thread David Gossage
-08-27 20:03:44.278274] I [MSGID: 108026]
>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>> finished full sweep on subvol GLUSTER1-client-1
>> [2016-08-27 21:00:42.603315] I [MSGID: 108026]
>> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>> starting full sweep on subvol GLUSTER1-client-1
>> [2016-08-27 21:00:46.148674] I [MSGID: 108026]
>> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
>> finished full sweep on subvol GLUSTER1-client-1
>>
>>
>>
>>
>>
>>>
>>>>
>>>>>>
>>>>>>> My suspicion is that this is what happened on your setup. Could you
>>>>>>> confirm if that was the case?
>>>>>>>
>>>>>>
>>>>>> Brick was brought online with force start then a full heal launched.
>>>>>> Hours later after it became evident that it was not adding new files to
>>>>>> heal I did try restarting self-heal daemon and relaunching full heal 
>>>>>> again.
>>>>>> But this was after the heal had basically already failed to work as
>>>>>> intended.
>>>>>>
>>>>>
>>>>> OK. How did you figure it was not adding any new files? I need to know
>>>>> what places you were monitoring to come to this conclusion.
>>>>>
>>>>> -Krutika
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>> As for those logs, I did manager to do something that caused these
>>>>>>> warning messages you shared earlier to appear in my client and server 
>>>>>>> logs.
>>>>>>> Although these logs are annoying and a bit scary too, they didn't do
>>>>>>> any harm to the data in my volume. Why they appear just after a brick is
>>>>>>> replaced and under no other circumstances is something I'm still
>>>>>>> investigating.
>>>>>>>
>>>>>>> But for future, it would be good to follow the steps Anuradha gave
>>>>>>> as that would allow self-heal to at least detect that it has some 
>>>>>>> repairing
>>>>>>> to do whenever it is restarted whether intentionally or otherwise.
>>>>>>>
>>>>>>
>>>>>> I followed those steps as described on my test box and ended up with
>>>>>> exact same outcome of adding shards at an agonizing slow pace and no
>>>>>> creation of .shard directory or heals on shard directory.  Directories
>>>>>> visible from mount healed quickly.  This was with one VM so it has only 
>>>>>> 800
>>>>>> shards as well.  After hours at work it had added a total of 33 shards to
>>>>>> be healed.  I sent those logs yesterday as well though not the 
>>>>>> glustershd.
>>>>>>
>>>>>> Does replace-brick command copy files in same manner?  For these
>>>>>> purposes I am contemplating just skipping the heal route.
>>>>>>
>>>>>>
>>>>>>> -Krutika
>>>>>>>
>>>>>>> On Tue, Aug 30, 2016 at 2:22 AM, David Gossage <
>>>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>>>
>>>>>>>> attached brick and client logs from test machine where same
>>>>>>>> behavior occurred not sure if anything new is there.  its still on 
>>>>>>>> 3.8.2
>>>>>>>>
>>>>>>>> Number of Bricks: 1 x 3 = 3
>>>>>>>> Transport-type: tcp
>>>>>>>> Bricks:
>>>>>>>> Brick1: 192.168.71.10:/gluster2/brick1/1
>>>>>>>> Brick2: 192.168.71.11:/gluster2/brick2/1
>>>>>>>> Brick3: 192.168.71.12:/gluster2/brick3/1
>>>>>>>> Options Reconfigured:
>>>>>>>> cluster.locking-scheme: granular
>>>>>>>> performance.strict-o-direct: off
>>>>>>>> features.shard-block-size: 64MB
>>>>>>>> features.shard: on
>>>>>>>> server.allow-insecure: on
>>>>>>>> storage.owner-uid: 36
>>>>>>>> storage.owner-gid: 36
>>>>>>>> cluster.server-quorum-type: server
>>>>>>>> c

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-30 Thread David Gossage
>>>> again.
>>>>> But this was after the heal had basically already failed to work as
>>>>> intended.
>>>>>
>>>>
>>>> OK. How did you figure it was not adding any new files? I need to know
>>>> what places you were monitoring to come to this conclusion.
>>>>
>>>> -Krutika
>>>>
>>>>
>>>>>
>>>>>
>>>>>> As for those logs, I did manager to do something that caused these
>>>>>> warning messages you shared earlier to appear in my client and server 
>>>>>> logs.
>>>>>> Although these logs are annoying and a bit scary too, they didn't do
>>>>>> any harm to the data in my volume. Why they appear just after a brick is
>>>>>> replaced and under no other circumstances is something I'm still
>>>>>> investigating.
>>>>>>
>>>>>> But for future, it would be good to follow the steps Anuradha gave as
>>>>>> that would allow self-heal to at least detect that it has some repairing 
>>>>>> to
>>>>>> do whenever it is restarted whether intentionally or otherwise.
>>>>>>
>>>>>
>>>>> I followed those steps as described on my test box and ended up with
>>>>> exact same outcome of adding shards at an agonizing slow pace and no
>>>>> creation of .shard directory or heals on shard directory.  Directories
>>>>> visible from mount healed quickly.  This was with one VM so it has only 
>>>>> 800
>>>>> shards as well.  After hours at work it had added a total of 33 shards to
>>>>> be healed.  I sent those logs yesterday as well though not the glustershd.
>>>>>
>>>>> Does replace-brick command copy files in same manner?  For these
>>>>> purposes I am contemplating just skipping the heal route.
>>>>>
>>>>>
>>>>>> -Krutika
>>>>>>
>>>>>> On Tue, Aug 30, 2016 at 2:22 AM, David Gossage <
>>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>>
>>>>>>> attached brick and client logs from test machine where same behavior
>>>>>>> occurred not sure if anything new is there.  its still on 3.8.2
>>>>>>>
>>>>>>> Number of Bricks: 1 x 3 = 3
>>>>>>> Transport-type: tcp
>>>>>>> Bricks:
>>>>>>> Brick1: 192.168.71.10:/gluster2/brick1/1
>>>>>>> Brick2: 192.168.71.11:/gluster2/brick2/1
>>>>>>> Brick3: 192.168.71.12:/gluster2/brick3/1
>>>>>>> Options Reconfigured:
>>>>>>> cluster.locking-scheme: granular
>>>>>>> performance.strict-o-direct: off
>>>>>>> features.shard-block-size: 64MB
>>>>>>> features.shard: on
>>>>>>> server.allow-insecure: on
>>>>>>> storage.owner-uid: 36
>>>>>>> storage.owner-gid: 36
>>>>>>> cluster.server-quorum-type: server
>>>>>>> cluster.quorum-type: auto
>>>>>>> network.remote-dio: on
>>>>>>> cluster.eager-lock: enable
>>>>>>> performance.stat-prefetch: off
>>>>>>> performance.io-cache: off
>>>>>>> performance.quick-read: off
>>>>>>> cluster.self-heal-window-size: 1024
>>>>>>> cluster.background-self-heal-count: 16
>>>>>>> nfs.enable-ino32: off
>>>>>>> nfs.addr-namelookup: off
>>>>>>> nfs.disable: on
>>>>>>> performance.read-ahead: off
>>>>>>> performance.readdir-ahead: on
>>>>>>> cluster.granular-entry-heal: on
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Aug 29, 2016 at 2:20 PM, David Gossage <
>>>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>>>
>>>>>>>> On Mon, Aug 29, 2016 at 7:01 AM, Anuradha Talur <ata...@redhat.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> - Original Message -
>>>>>>>>> > From: "David Gossage" <dgoss...@carouselchecks.com>
>>>>>

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-30 Thread David Gossage
]
[afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0: finished
full sweep on subvol GLUSTER1-client-1
[2016-08-27 12:58:34.597228] I [MSGID: 108026]
[afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0: starting
full sweep on subvol GLUSTER1-client-1
[2016-08-27 12:59:28.041173] I [MSGID: 108026]
[afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0: finished
full sweep on subvol GLUSTER1-client-1
[2016-08-27 20:03:42.560188] I [MSGID: 108026]
[afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0: starting
full sweep on subvol GLUSTER1-client-1
[2016-08-27 20:03:44.278274] I [MSGID: 108026]
[afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0: finished
full sweep on subvol GLUSTER1-client-1
[2016-08-27 21:00:42.603315] I [MSGID: 108026]
[afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0: starting
full sweep on subvol GLUSTER1-client-1
[2016-08-27 21:00:46.148674] I [MSGID: 108026]
[afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0: finished
full sweep on subvol GLUSTER1-client-1





>
>>
>>>>
>>>>> My suspicion is that this is what happened on your setup. Could you
>>>>> confirm if that was the case?
>>>>>
>>>>
>>>> Brick was brought online with force start then a full heal launched.
>>>> Hours later after it became evident that it was not adding new files to
>>>> heal I did try restarting self-heal daemon and relaunching full heal again.
>>>> But this was after the heal had basically already failed to work as
>>>> intended.
>>>>
>>>
>>> OK. How did you figure it was not adding any new files? I need to know
>>> what places you were monitoring to come to this conclusion.
>>>
>>> -Krutika
>>>
>>>
>>>>
>>>>
>>>>> As for those logs, I did manager to do something that caused these
>>>>> warning messages you shared earlier to appear in my client and server 
>>>>> logs.
>>>>> Although these logs are annoying and a bit scary too, they didn't do
>>>>> any harm to the data in my volume. Why they appear just after a brick is
>>>>> replaced and under no other circumstances is something I'm still
>>>>> investigating.
>>>>>
>>>>> But for future, it would be good to follow the steps Anuradha gave as
>>>>> that would allow self-heal to at least detect that it has some repairing 
>>>>> to
>>>>> do whenever it is restarted whether intentionally or otherwise.
>>>>>
>>>>
>>>> I followed those steps as described on my test box and ended up with
>>>> exact same outcome of adding shards at an agonizing slow pace and no
>>>> creation of .shard directory or heals on shard directory.  Directories
>>>> visible from mount healed quickly.  This was with one VM so it has only 800
>>>> shards as well.  After hours at work it had added a total of 33 shards to
>>>> be healed.  I sent those logs yesterday as well though not the glustershd.
>>>>
>>>> Does replace-brick command copy files in same manner?  For these
>>>> purposes I am contemplating just skipping the heal route.
>>>>
>>>>
>>>>> -Krutika
>>>>>
>>>>> On Tue, Aug 30, 2016 at 2:22 AM, David Gossage <
>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>
>>>>>> attached brick and client logs from test machine where same behavior
>>>>>> occurred not sure if anything new is there.  its still on 3.8.2
>>>>>>
>>>>>> Number of Bricks: 1 x 3 = 3
>>>>>> Transport-type: tcp
>>>>>> Bricks:
>>>>>> Brick1: 192.168.71.10:/gluster2/brick1/1
>>>>>> Brick2: 192.168.71.11:/gluster2/brick2/1
>>>>>> Brick3: 192.168.71.12:/gluster2/brick3/1
>>>>>> Options Reconfigured:
>>>>>> cluster.locking-scheme: granular
>>>>>> performance.strict-o-direct: off
>>>>>> features.shard-block-size: 64MB
>>>>>> features.shard: on
>>>>>> server.allow-insecure: on
>>>>>> storage.owner-uid: 36
>>>>>> storage.owner-gid: 36
>>>>>> cluster.server-quorum-type: server
>>>>>> cluster.quorum-type: auto
>>>>>> network.remote-dio: on
>>>>>> cluster.eager-lock: enable
>>>>>> performance.stat-prefetch: off
>&

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-30 Thread David Gossage
ng any new files? I need to know
>>> what places you were monitoring to come to this conclusion.
>>>
>>> -Krutika
>>>
>>>
>>>>
>>>>
>>>>> As for those logs, I did manager to do something that caused these
>>>>> warning messages you shared earlier to appear in my client and server 
>>>>> logs.
>>>>> Although these logs are annoying and a bit scary too, they didn't do
>>>>> any harm to the data in my volume. Why they appear just after a brick is
>>>>> replaced and under no other circumstances is something I'm still
>>>>> investigating.
>>>>>
>>>>> But for future, it would be good to follow the steps Anuradha gave as
>>>>> that would allow self-heal to at least detect that it has some repairing 
>>>>> to
>>>>> do whenever it is restarted whether intentionally or otherwise.
>>>>>
>>>>
>>>> I followed those steps as described on my test box and ended up with
>>>> exact same outcome of adding shards at an agonizing slow pace and no
>>>> creation of .shard directory or heals on shard directory.  Directories
>>>> visible from mount healed quickly.  This was with one VM so it has only 800
>>>> shards as well.  After hours at work it had added a total of 33 shards to
>>>> be healed.  I sent those logs yesterday as well though not the glustershd.
>>>>
>>>> Does replace-brick command copy files in same manner?  For these
>>>> purposes I am contemplating just skipping the heal route.
>>>>
>>>>
>>>>> -Krutika
>>>>>
>>>>> On Tue, Aug 30, 2016 at 2:22 AM, David Gossage <
>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>
>>>>>> attached brick and client logs from test machine where same behavior
>>>>>> occurred not sure if anything new is there.  its still on 3.8.2
>>>>>>
>>>>>> Number of Bricks: 1 x 3 = 3
>>>>>> Transport-type: tcp
>>>>>> Bricks:
>>>>>> Brick1: 192.168.71.10:/gluster2/brick1/1
>>>>>> Brick2: 192.168.71.11:/gluster2/brick2/1
>>>>>> Brick3: 192.168.71.12:/gluster2/brick3/1
>>>>>> Options Reconfigured:
>>>>>> cluster.locking-scheme: granular
>>>>>> performance.strict-o-direct: off
>>>>>> features.shard-block-size: 64MB
>>>>>> features.shard: on
>>>>>> server.allow-insecure: on
>>>>>> storage.owner-uid: 36
>>>>>> storage.owner-gid: 36
>>>>>> cluster.server-quorum-type: server
>>>>>> cluster.quorum-type: auto
>>>>>> network.remote-dio: on
>>>>>> cluster.eager-lock: enable
>>>>>> performance.stat-prefetch: off
>>>>>> performance.io-cache: off
>>>>>> performance.quick-read: off
>>>>>> cluster.self-heal-window-size: 1024
>>>>>> cluster.background-self-heal-count: 16
>>>>>> nfs.enable-ino32: off
>>>>>> nfs.addr-namelookup: off
>>>>>> nfs.disable: on
>>>>>> performance.read-ahead: off
>>>>>> performance.readdir-ahead: on
>>>>>> cluster.granular-entry-heal: on
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 29, 2016 at 2:20 PM, David Gossage <
>>>>>> dgoss...@carouselchecks.com> wrote:
>>>>>>
>>>>>>> On Mon, Aug 29, 2016 at 7:01 AM, Anuradha Talur <ata...@redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> - Original Message -
>>>>>>>> > From: "David Gossage" <dgoss...@carouselchecks.com>
>>>>>>>> > To: "Anuradha Talur" <ata...@redhat.com>
>>>>>>>> > Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org>,
>>>>>>>> "Krutika Dhananjay" <kdhan...@redhat.com>
>>>>>>>> > Sent: Monday, August 29, 2016 5:12:42 PM
>>>>>>>> > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
>>>>>>>> >
>>>>>>>> > On Mon, Aug 29, 2016 at 5:39 A

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-30 Thread David Gossage
rick3/1
>>>> Options Reconfigured:
>>>> cluster.locking-scheme: granular
>>>> performance.strict-o-direct: off
>>>> features.shard-block-size: 64MB
>>>> features.shard: on
>>>> server.allow-insecure: on
>>>> storage.owner-uid: 36
>>>> storage.owner-gid: 36
>>>> cluster.server-quorum-type: server
>>>> cluster.quorum-type: auto
>>>> network.remote-dio: on
>>>> cluster.eager-lock: enable
>>>> performance.stat-prefetch: off
>>>> performance.io-cache: off
>>>> performance.quick-read: off
>>>> cluster.self-heal-window-size: 1024
>>>> cluster.background-self-heal-count: 16
>>>> nfs.enable-ino32: off
>>>> nfs.addr-namelookup: off
>>>> nfs.disable: on
>>>> performance.read-ahead: off
>>>> performance.readdir-ahead: on
>>>> cluster.granular-entry-heal: on
>>>>
>>>>
>>>>
>>>> On Mon, Aug 29, 2016 at 2:20 PM, David Gossage <
>>>> dgoss...@carouselchecks.com> wrote:
>>>>
>>>>> On Mon, Aug 29, 2016 at 7:01 AM, Anuradha Talur <ata...@redhat.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> - Original Message -
>>>>>> > From: "David Gossage" <dgoss...@carouselchecks.com>
>>>>>> > To: "Anuradha Talur" <ata...@redhat.com>
>>>>>> > Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org>,
>>>>>> "Krutika Dhananjay" <kdhan...@redhat.com>
>>>>>> > Sent: Monday, August 29, 2016 5:12:42 PM
>>>>>> > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
>>>>>> >
>>>>>> > On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur <ata...@redhat.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > > Response inline.
>>>>>> > >
>>>>>> > > - Original Message -
>>>>>> > > > From: "Krutika Dhananjay" <kdhan...@redhat.com>
>>>>>> > > > To: "David Gossage" <dgoss...@carouselchecks.com>
>>>>>> > > > Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org
>>>>>> >
>>>>>> > > > Sent: Monday, August 29, 2016 3:55:04 PM
>>>>>> > > > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
>>>>>> > > >
>>>>>> > > > Could you attach both client and brick logs? Meanwhile I will
>>>>>> try these
>>>>>> > > steps
>>>>>> > > > out on my machines and see if it is easily recreatable.
>>>>>> > > >
>>>>>> > > > -Krutika
>>>>>> > > >
>>>>>> > > > On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
>>>>>> > > dgoss...@carouselchecks.com
>>>>>> > > > > wrote:
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > Centos 7 Gluster 3.8.3
>>>>>> > > >
>>>>>> > > > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>>>>>> > > > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>>>>>> > > > Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
>>>>>> > > > Options Reconfigured:
>>>>>> > > > cluster.data-self-heal-algorithm: full
>>>>>> > > > cluster.self-heal-daemon: on
>>>>>> > > > cluster.locking-scheme: granular
>>>>>> > > > features.shard-block-size: 64MB
>>>>>> > > > features.shard: on
>>>>>> > > > performance.readdir-ahead: on
>>>>>> > > > storage.owner-uid: 36
>>>>>> > > > storage.owner-gid: 36
>>>>>> > > > performance.quick-read: off
>>>>>> > > > performance.read-ahead: off
>>>>>> > > > performance.io-cache: off
>>>>>> > > > performance.stat-prefetch: on
>>>>>> > > > cluster.eager-lock: enable
>>>>>> > > > 

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-30 Thread Krutika Dhananjay
ck3: 192.168.71.12:/gluster2/brick3/1
>>>> Options Reconfigured:
>>>> cluster.locking-scheme: granular
>>>> performance.strict-o-direct: off
>>>> features.shard-block-size: 64MB
>>>> features.shard: on
>>>> server.allow-insecure: on
>>>> storage.owner-uid: 36
>>>> storage.owner-gid: 36
>>>> cluster.server-quorum-type: server
>>>> cluster.quorum-type: auto
>>>> network.remote-dio: on
>>>> cluster.eager-lock: enable
>>>> performance.stat-prefetch: off
>>>> performance.io-cache: off
>>>> performance.quick-read: off
>>>> cluster.self-heal-window-size: 1024
>>>> cluster.background-self-heal-count: 16
>>>> nfs.enable-ino32: off
>>>> nfs.addr-namelookup: off
>>>> nfs.disable: on
>>>> performance.read-ahead: off
>>>> performance.readdir-ahead: on
>>>> cluster.granular-entry-heal: on
>>>>
>>>>
>>>>
>>>> On Mon, Aug 29, 2016 at 2:20 PM, David Gossage <
>>>> dgoss...@carouselchecks.com> wrote:
>>>>
>>>>> On Mon, Aug 29, 2016 at 7:01 AM, Anuradha Talur <ata...@redhat.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> - Original Message -
>>>>>> > From: "David Gossage" <dgoss...@carouselchecks.com>
>>>>>> > To: "Anuradha Talur" <ata...@redhat.com>
>>>>>> > Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org>,
>>>>>> "Krutika Dhananjay" <kdhan...@redhat.com>
>>>>>> > Sent: Monday, August 29, 2016 5:12:42 PM
>>>>>> > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
>>>>>> >
>>>>>> > On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur <ata...@redhat.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > > Response inline.
>>>>>> > >
>>>>>> > > - Original Message -
>>>>>> > > > From: "Krutika Dhananjay" <kdhan...@redhat.com>
>>>>>> > > > To: "David Gossage" <dgoss...@carouselchecks.com>
>>>>>> > > > Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org
>>>>>> >
>>>>>> > > > Sent: Monday, August 29, 2016 3:55:04 PM
>>>>>> > > > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
>>>>>> > > >
>>>>>> > > > Could you attach both client and brick logs? Meanwhile I will
>>>>>> try these
>>>>>> > > steps
>>>>>> > > > out on my machines and see if it is easily recreatable.
>>>>>> > > >
>>>>>> > > > -Krutika
>>>>>> > > >
>>>>>> > > > On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
>>>>>> > > dgoss...@carouselchecks.com
>>>>>> > > > > wrote:
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > Centos 7 Gluster 3.8.3
>>>>>> > > >
>>>>>> > > > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>>>>>> > > > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>>>>>> > > > Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
>>>>>> > > > Options Reconfigured:
>>>>>> > > > cluster.data-self-heal-algorithm: full
>>>>>> > > > cluster.self-heal-daemon: on
>>>>>> > > > cluster.locking-scheme: granular
>>>>>> > > > features.shard-block-size: 64MB
>>>>>> > > > features.shard: on
>>>>>> > > > performance.readdir-ahead: on
>>>>>> > > > storage.owner-uid: 36
>>>>>> > > > storage.owner-gid: 36
>>>>>> > > > performance.quick-read: off
>>>>>> > > > performance.read-ahead: off
>>>>>> > > > performance.io-cache: off
>>>>>> > > > performance.stat-prefetch: on
>>>>>> > > > cluster.eager-lock: enable

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-30 Thread Krutika Dhananjay
On Tue, Aug 30, 2016 at 6:07 PM, David Gossage <dgoss...@carouselchecks.com>
wrote:

> On Tue, Aug 30, 2016 at 7:18 AM, Krutika Dhananjay <kdhan...@redhat.com>
> wrote:
>
>> Could you also share the glustershd logs?
>>
>
> I'll get them when I get to work sure.
>
>
>>
>> I tried the same steps that you mentioned multiple times, but heal is
>> running to completion without any issues.
>>
>> It must be said that 'heal full' traverses the files and directories in a
>> depth-first order and does heals also in the same order. But if it gets
>> interrupted in the middle (say because self-heal-daemon was either
>> intentionally or unintentionally brought offline and then brought back up),
>> self-heal will only pick up the entries that are so far marked as
>> new-entries that need heal which it will find in indices/xattrop directory.
>> What this means is that those files and directories that were not visited
>> during the crawl, will remain untouched and unhealed in this second
>> iteration of heal, unless you execute a 'heal-full' again.
>>
>
> So should it start healing shards as it crawls or not until after it
> crawls the entire .shard directory?  At the pace it was going that could be
> a week with one node appearing in the cluster but with no shard files if
> anything tries to access a file on that node.  From my experience other day
> telling it to heal full again did nothing regardless of node used.
>
>
>> My suspicion is that this is what happened on your setup. Could you
>> confirm if that was the case?
>>
>
> Brick was brought online with force start then a full heal launched.
> Hours later after it became evident that it was not adding new files to
> heal I did try restarting self-heal daemon and relaunching full heal again.
> But this was after the heal had basically already failed to work as
> intended.
>

OK. How did you figure it was not adding any new files? I need to know what
places you were monitoring to come to this conclusion.

-Krutika


>
>
>> As for those logs, I did manager to do something that caused these
>> warning messages you shared earlier to appear in my client and server logs.
>> Although these logs are annoying and a bit scary too, they didn't do any
>> harm to the data in my volume. Why they appear just after a brick is
>> replaced and under no other circumstances is something I'm still
>> investigating.
>>
>> But for future, it would be good to follow the steps Anuradha gave as
>> that would allow self-heal to at least detect that it has some repairing to
>> do whenever it is restarted whether intentionally or otherwise.
>>
>
> I followed those steps as described on my test box and ended up with exact
> same outcome of adding shards at an agonizing slow pace and no creation of
> .shard directory or heals on shard directory.  Directories visible from
> mount healed quickly.  This was with one VM so it has only 800 shards as
> well.  After hours at work it had added a total of 33 shards to be healed.
> I sent those logs yesterday as well though not the glustershd.
>
> Does replace-brick command copy files in same manner?  For these purposes
> I am contemplating just skipping the heal route.
>
>
>> -Krutika
>>
>> On Tue, Aug 30, 2016 at 2:22 AM, David Gossage <
>> dgoss...@carouselchecks.com> wrote:
>>
>>> attached brick and client logs from test machine where same behavior
>>> occurred not sure if anything new is there.  its still on 3.8.2
>>>
>>> Number of Bricks: 1 x 3 = 3
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: 192.168.71.10:/gluster2/brick1/1
>>> Brick2: 192.168.71.11:/gluster2/brick2/1
>>> Brick3: 192.168.71.12:/gluster2/brick3/1
>>> Options Reconfigured:
>>> cluster.locking-scheme: granular
>>> performance.strict-o-direct: off
>>> features.shard-block-size: 64MB
>>> features.shard: on
>>> server.allow-insecure: on
>>> storage.owner-uid: 36
>>> storage.owner-gid: 36
>>> cluster.server-quorum-type: server
>>> cluster.quorum-type: auto
>>> network.remote-dio: on
>>> cluster.eager-lock: enable
>>> performance.stat-prefetch: off
>>> performance.io-cache: off
>>> performance.quick-read: off
>>> cluster.self-heal-window-size: 1024
>>> cluster.background-self-heal-count: 16
>>> nfs.enable-ino32: off
>>> nfs.addr-namelookup: off
>>> nfs.disable: on
>>> performance.read-ahead: off
>>> performance.readdir-ahead: on
>>> cluster.granular-entry-heal: on
>>&

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-30 Thread David Gossage
On Tue, Aug 30, 2016 at 7:18 AM, Krutika Dhananjay <kdhan...@redhat.com>
wrote:

> Could you also share the glustershd logs?
>

I'll get them when I get to work sure.


>
> I tried the same steps that you mentioned multiple times, but heal is
> running to completion without any issues.
>
> It must be said that 'heal full' traverses the files and directories in a
> depth-first order and does heals also in the same order. But if it gets
> interrupted in the middle (say because self-heal-daemon was either
> intentionally or unintentionally brought offline and then brought back up),
> self-heal will only pick up the entries that are so far marked as
> new-entries that need heal which it will find in indices/xattrop directory.
> What this means is that those files and directories that were not visited
> during the crawl, will remain untouched and unhealed in this second
> iteration of heal, unless you execute a 'heal-full' again.
>

So should it start healing shards as it crawls or not until after it crawls
the entire .shard directory?  At the pace it was going that could be a week
with one node appearing in the cluster but with no shard files if anything
tries to access a file on that node.  From my experience other day telling
it to heal full again did nothing regardless of node used.


> My suspicion is that this is what happened on your setup. Could you
> confirm if that was the case?
>

Brick was brought online with force start then a full heal launched.  Hours
later after it became evident that it was not adding new files to heal I
did try restarting self-heal daemon and relaunching full heal again. But
this was after the heal had basically already failed to work as intended.


> As for those logs, I did manager to do something that caused these warning
> messages you shared earlier to appear in my client and server logs.
> Although these logs are annoying and a bit scary too, they didn't do any
> harm to the data in my volume. Why they appear just after a brick is
> replaced and under no other circumstances is something I'm still
> investigating.
>
> But for future, it would be good to follow the steps Anuradha gave as that
> would allow self-heal to at least detect that it has some repairing to do
> whenever it is restarted whether intentionally or otherwise.
>

I followed those steps as described on my test box and ended up with exact
same outcome of adding shards at an agonizing slow pace and no creation of
.shard directory or heals on shard directory.  Directories visible from
mount healed quickly.  This was with one VM so it has only 800 shards as
well.  After hours at work it had added a total of 33 shards to be healed.
I sent those logs yesterday as well though not the glustershd.

Does replace-brick command copy files in same manner?  For these purposes I
am contemplating just skipping the heal route.


> -Krutika
>
> On Tue, Aug 30, 2016 at 2:22 AM, David Gossage <
> dgoss...@carouselchecks.com> wrote:
>
>> attached brick and client logs from test machine where same behavior
>> occurred not sure if anything new is there.  its still on 3.8.2
>>
>> Number of Bricks: 1 x 3 = 3
>> Transport-type: tcp
>> Bricks:
>> Brick1: 192.168.71.10:/gluster2/brick1/1
>> Brick2: 192.168.71.11:/gluster2/brick2/1
>> Brick3: 192.168.71.12:/gluster2/brick3/1
>> Options Reconfigured:
>> cluster.locking-scheme: granular
>> performance.strict-o-direct: off
>> features.shard-block-size: 64MB
>> features.shard: on
>> server.allow-insecure: on
>> storage.owner-uid: 36
>> storage.owner-gid: 36
>> cluster.server-quorum-type: server
>> cluster.quorum-type: auto
>> network.remote-dio: on
>> cluster.eager-lock: enable
>> performance.stat-prefetch: off
>> performance.io-cache: off
>> performance.quick-read: off
>> cluster.self-heal-window-size: 1024
>> cluster.background-self-heal-count: 16
>> nfs.enable-ino32: off
>> nfs.addr-namelookup: off
>> nfs.disable: on
>> performance.read-ahead: off
>> performance.readdir-ahead: on
>> cluster.granular-entry-heal: on
>>
>>
>>
>> On Mon, Aug 29, 2016 at 2:20 PM, David Gossage <
>> dgoss...@carouselchecks.com> wrote:
>>
>>> On Mon, Aug 29, 2016 at 7:01 AM, Anuradha Talur <ata...@redhat.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> - Original Message -
>>>> > From: "David Gossage" <dgoss...@carouselchecks.com>
>>>> > To: "Anuradha Talur" <ata...@redhat.com>
>>>> > Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org>,
>>>> "Krutika Dhananjay" <kdhan

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-30 Thread Krutika Dhananjay
Could you also share the glustershd logs?

I tried the same steps that you mentioned multiple times, but heal is
running to completion without any issues.

It must be said that 'heal full' traverses the files and directories in a
depth-first order and does heals also in the same order. But if it gets
interrupted in the middle (say because self-heal-daemon was either
intentionally or unintentionally brought offline and then brought back up),
self-heal will only pick up the entries that are so far marked as
new-entries that need heal which it will find in indices/xattrop directory.
What this means is that those files and directories that were not visited
during the crawl, will remain untouched and unhealed in this second
iteration of heal, unless you execute a 'heal-full' again.

My suspicion is that this is what happened on your setup. Could you confirm
if that was the case?

As for those logs, I did manager to do something that caused these warning
messages you shared earlier to appear in my client and server logs.
Although these logs are annoying and a bit scary too, they didn't do any
harm to the data in my volume. Why they appear just after a brick is
replaced and under no other circumstances is something I'm still
investigating.

But for future, it would be good to follow the steps Anuradha gave as that
would allow self-heal to at least detect that it has some repairing to do
whenever it is restarted whether intentionally or otherwise.

-Krutika

On Tue, Aug 30, 2016 at 2:22 AM, David Gossage <dgoss...@carouselchecks.com>
wrote:

> attached brick and client logs from test machine where same behavior
> occurred not sure if anything new is there.  its still on 3.8.2
>
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: 192.168.71.10:/gluster2/brick1/1
> Brick2: 192.168.71.11:/gluster2/brick2/1
> Brick3: 192.168.71.12:/gluster2/brick3/1
> Options Reconfigured:
> cluster.locking-scheme: granular
> performance.strict-o-direct: off
> features.shard-block-size: 64MB
> features.shard: on
> server.allow-insecure: on
> storage.owner-uid: 36
> storage.owner-gid: 36
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> network.remote-dio: on
> cluster.eager-lock: enable
> performance.stat-prefetch: off
> performance.io-cache: off
> performance.quick-read: off
> cluster.self-heal-window-size: 1024
> cluster.background-self-heal-count: 16
> nfs.enable-ino32: off
> nfs.addr-namelookup: off
> nfs.disable: on
> performance.read-ahead: off
> performance.readdir-ahead: on
> cluster.granular-entry-heal: on
>
>
>
> On Mon, Aug 29, 2016 at 2:20 PM, David Gossage <
> dgoss...@carouselchecks.com> wrote:
>
>> On Mon, Aug 29, 2016 at 7:01 AM, Anuradha Talur <ata...@redhat.com>
>> wrote:
>>
>>>
>>>
>>> - Original Message -
>>> > From: "David Gossage" <dgoss...@carouselchecks.com>
>>> > To: "Anuradha Talur" <ata...@redhat.com>
>>> > Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org>,
>>> "Krutika Dhananjay" <kdhan...@redhat.com>
>>> > Sent: Monday, August 29, 2016 5:12:42 PM
>>> > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
>>> >
>>> > On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur <ata...@redhat.com>
>>> wrote:
>>> >
>>> > > Response inline.
>>> > >
>>> > > - Original Message -
>>> > > > From: "Krutika Dhananjay" <kdhan...@redhat.com>
>>> > > > To: "David Gossage" <dgoss...@carouselchecks.com>
>>> > > > Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org>
>>> > > > Sent: Monday, August 29, 2016 3:55:04 PM
>>> > > > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
>>> > > >
>>> > > > Could you attach both client and brick logs? Meanwhile I will try
>>> these
>>> > > steps
>>> > > > out on my machines and see if it is easily recreatable.
>>> > > >
>>> > > > -Krutika
>>> > > >
>>> > > > On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
>>> > > dgoss...@carouselchecks.com
>>> > > > > wrote:
>>> > > >
>>> > > >
>>> > > >
>>> > > > Centos 7 Gluster 3.8.3
>>> > > >
>>> > > > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>>> > > > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>>> >

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-30 Thread David Gossage
On Mon, Aug 29, 2016 at 11:25 PM, Darrell Budic 
wrote:

> I noticed that my new brick (replacement disk) did not have a .shard
> directory created on the brick, if that helps.
>
> I removed the affected brick from the volume and then wiped the disk, did
> an add-brick, and everything healed right up. I didn’t try and set any
> attrs or anything else, just removed and added the brick as new.
>

I was considering just using the replace brick command instead.  The use of
heal was basically I was hoping it worked so I could keep brick directory
scheme same as before.  But if the tradeoff is just having brick
directories with slightly different paths, but copy of files succeeding I'd
rather have the node back up,

>
> On Aug 29, 2016, at 9:49 AM, Darrell Budic  wrote:
>
> Just to let you know I’m seeing the same issue under 3.7.14 on CentOS 7.
> Some content was healed correctly, now all the shards are queued up in a
> heal list, but nothing is healing. Got similar brick errors logged to the
> ones David was getting on the brick that isn’t healing:
>
> [2016-08-29 03:31:40.436110] E [MSGID: 115050]
> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822:
> LOOKUP (null) 
> (----/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29)
> ==> (Invalid argument) [Invalid argument]
> [2016-08-29 03:31:43.005013] E [MSGID: 115050]
> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802:
> LOOKUP (null) 
> (----/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40)
> ==> (Invalid argument) [Invalid argument]
>
> This was after replacing the drive the brick was on and trying to get it
> back into the system by setting the volume's fattr on the brick dir. I’ll
> try the suggested method here on it it shortly.
>
>   -Darrell
>
>
> On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay 
> wrote:
>
> Got it. Thanks.
>
> I tried the same test and shd crashed with SIGABRT (well, that's because I
> compiled from src with -DDEBUG).
> In any case, this error would prevent full heal from proceeding further.
> I'm debugging the crash now. Will let you know when I have the RC.
>
> -Krutika
>
> On Mon, Aug 29, 2016 at 5:47 PM, David Gossage <
> dgoss...@carouselchecks.com> wrote:
>
>>
>> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage <
>> dgoss...@carouselchecks.com> wrote:
>>
>>> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay 
>>> wrote:
>>>
 Could you attach both client and brick logs? Meanwhile I will try these
 steps out on my machines and see if it is easily recreatable.


>>> Hoping 7z files are accepted by mail server.
>>>
>>
>> looks like zip file awaiting approval due to size
>>
>>>
>>> -Krutika

 On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
 dgoss...@carouselchecks.com> wrote:

> Centos 7 Gluster 3.8.3
>
> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> Options Reconfigured:
> cluster.data-self-heal-algorithm: full
> cluster.self-heal-daemon: on
> cluster.locking-scheme: granular
> features.shard-block-size: 64MB
> features.shard: on
> performance.readdir-ahead: on
> storage.owner-uid: 36
> storage.owner-gid: 36
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: on
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> server.allow-insecure: on
> cluster.self-heal-window-size: 1024
> cluster.background-self-heal-count: 16
> performance.strict-write-ordering: off
> nfs.disable: on
> nfs.addr-namelookup: off
> nfs.enable-ino32: off
> cluster.granular-entry-heal: on
>
> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> Following steps detailed in previous recommendations began proces of
> replacing and healngbricks one node at a time.
>
> 1) kill pid of brick
> 2) reconfigure brick from raid6 to raid10
> 3) recreate directory of brick
> 4) gluster volume start <> force
> 5) gluster volume heal <> full
>
> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
> little heavy but nothing shocking.
>
> About an hour after node 1 finished I began same process on node2.
> Heal proces kicked in as before and the files in directories visible from
> mount and .glusterfs healed in short time.  Then it began crawl of .shard
> adding those files to heal count at which point the entire proces ground 
> to
> a halt basically.  After 48 hours out of 19k shards it has added 5900 to
> heal list.  Load on all 3 machnes is negligible.   It was suggested to
> change this value to full cluster.data-self-heal-algorithm 

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Krutika Dhananjay
Ignore. I just realised you're on 3.7.14. So then the problem may not be
with granular entry self-heal feature.

-Krutika

On Tue, Aug 30, 2016 at 10:14 AM, Krutika Dhananjay 
wrote:

> OK. Do you also have granular-entry-heal on - just so that I can isolate
> the problem area.
>
> -Krutika
>
> On Tue, Aug 30, 2016 at 9:55 AM, Darrell Budic 
> wrote:
>
>> I noticed that my new brick (replacement disk) did not have a .shard
>> directory created on the brick, if that helps.
>>
>> I removed the affected brick from the volume and then wiped the disk, did
>> an add-brick, and everything healed right up. I didn’t try and set any
>> attrs or anything else, just removed and added the brick as new.
>>
>> On Aug 29, 2016, at 9:49 AM, Darrell Budic 
>> wrote:
>>
>> Just to let you know I’m seeing the same issue under 3.7.14 on CentOS 7.
>> Some content was healed correctly, now all the shards are queued up in a
>> heal list, but nothing is healing. Got similar brick errors logged to the
>> ones David was getting on the brick that isn’t healing:
>>
>> [2016-08-29 03:31:40.436110] E [MSGID: 115050]
>> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822:
>> LOOKUP (null) (----
>> /0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29) ==> (Invalid argument)
>> [Invalid argument]
>> [2016-08-29 03:31:43.005013] E [MSGID: 115050]
>> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802:
>> LOOKUP (null) (----
>> /0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40) ==> (Invalid argument)
>> [Invalid argument]
>>
>> This was after replacing the drive the brick was on and trying to get it
>> back into the system by setting the volume's fattr on the brick dir. I’ll
>> try the suggested method here on it it shortly.
>>
>>   -Darrell
>>
>>
>> On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay 
>> wrote:
>>
>> Got it. Thanks.
>>
>> I tried the same test and shd crashed with SIGABRT (well, that's because
>> I compiled from src with -DDEBUG).
>> In any case, this error would prevent full heal from proceeding further.
>> I'm debugging the crash now. Will let you know when I have the RC.
>>
>> -Krutika
>>
>> On Mon, Aug 29, 2016 at 5:47 PM, David Gossage <
>> dgoss...@carouselchecks.com> wrote:
>>
>>>
>>> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage <
>>> dgoss...@carouselchecks.com> wrote:
>>>
 On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay  wrote:

> Could you attach both client and brick logs? Meanwhile I will try
> these steps out on my machines and see if it is easily recreatable.
>
>
 Hoping 7z files are accepted by mail server.

>>>
>>> looks like zip file awaiting approval due to size
>>>

 -Krutika
>
> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
> dgoss...@carouselchecks.com> wrote:
>
>> Centos 7 Gluster 3.8.3
>>
>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
>> Options Reconfigured:
>> cluster.data-self-heal-algorithm: full
>> cluster.self-heal-daemon: on
>> cluster.locking-scheme: granular
>> features.shard-block-size: 64MB
>> features.shard: on
>> performance.readdir-ahead: on
>> storage.owner-uid: 36
>> storage.owner-gid: 36
>> performance.quick-read: off
>> performance.read-ahead: off
>> performance.io-cache: off
>> performance.stat-prefetch: on
>> cluster.eager-lock: enable
>> network.remote-dio: enable
>> cluster.quorum-type: auto
>> cluster.server-quorum-type: server
>> server.allow-insecure: on
>> cluster.self-heal-window-size: 1024
>> cluster.background-self-heal-count: 16
>> performance.strict-write-ordering: off
>> nfs.disable: on
>> nfs.addr-namelookup: off
>> nfs.enable-ino32: off
>> cluster.granular-entry-heal: on
>>
>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
>> Following steps detailed in previous recommendations began proces of
>> replacing and healngbricks one node at a time.
>>
>> 1) kill pid of brick
>> 2) reconfigure brick from raid6 to raid10
>> 3) recreate directory of brick
>> 4) gluster volume start <> force
>> 5) gluster volume heal <> full
>>
>> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
>> little heavy but nothing shocking.
>>
>> About an hour after node 1 finished I began same process on node2.
>> Heal proces kicked in as before and the files in directories visible from
>> mount and .glusterfs healed in short time.  Then it began crawl of .shard
>> adding those files to heal count at which point the entire proces ground 
>> to
>> a halt basically.  After 48 hours out of 19k shards it has added 5900 to
>> heal list.  Load 

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Krutika Dhananjay
OK. Do you also have granular-entry-heal on - just so that I can isolate
the problem area.

-Krutika

On Tue, Aug 30, 2016 at 9:55 AM, Darrell Budic 
wrote:

> I noticed that my new brick (replacement disk) did not have a .shard
> directory created on the brick, if that helps.
>
> I removed the affected brick from the volume and then wiped the disk, did
> an add-brick, and everything healed right up. I didn’t try and set any
> attrs or anything else, just removed and added the brick as new.
>
> On Aug 29, 2016, at 9:49 AM, Darrell Budic  wrote:
>
> Just to let you know I’m seeing the same issue under 3.7.14 on CentOS 7.
> Some content was healed correctly, now all the shards are queued up in a
> heal list, but nothing is healing. Got similar brick errors logged to the
> ones David was getting on the brick that isn’t healing:
>
> [2016-08-29 03:31:40.436110] E [MSGID: 115050]
> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822:
> LOOKUP (null) 
> (----/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29)
> ==> (Invalid argument) [Invalid argument]
> [2016-08-29 03:31:43.005013] E [MSGID: 115050]
> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802:
> LOOKUP (null) 
> (----/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40)
> ==> (Invalid argument) [Invalid argument]
>
> This was after replacing the drive the brick was on and trying to get it
> back into the system by setting the volume's fattr on the brick dir. I’ll
> try the suggested method here on it it shortly.
>
>   -Darrell
>
>
> On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay 
> wrote:
>
> Got it. Thanks.
>
> I tried the same test and shd crashed with SIGABRT (well, that's because I
> compiled from src with -DDEBUG).
> In any case, this error would prevent full heal from proceeding further.
> I'm debugging the crash now. Will let you know when I have the RC.
>
> -Krutika
>
> On Mon, Aug 29, 2016 at 5:47 PM, David Gossage <
> dgoss...@carouselchecks.com> wrote:
>
>>
>> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage <
>> dgoss...@carouselchecks.com> wrote:
>>
>>> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay 
>>> wrote:
>>>
 Could you attach both client and brick logs? Meanwhile I will try these
 steps out on my machines and see if it is easily recreatable.


>>> Hoping 7z files are accepted by mail server.
>>>
>>
>> looks like zip file awaiting approval due to size
>>
>>>
>>> -Krutika

 On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
 dgoss...@carouselchecks.com> wrote:

> Centos 7 Gluster 3.8.3
>
> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> Options Reconfigured:
> cluster.data-self-heal-algorithm: full
> cluster.self-heal-daemon: on
> cluster.locking-scheme: granular
> features.shard-block-size: 64MB
> features.shard: on
> performance.readdir-ahead: on
> storage.owner-uid: 36
> storage.owner-gid: 36
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: on
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> server.allow-insecure: on
> cluster.self-heal-window-size: 1024
> cluster.background-self-heal-count: 16
> performance.strict-write-ordering: off
> nfs.disable: on
> nfs.addr-namelookup: off
> nfs.enable-ino32: off
> cluster.granular-entry-heal: on
>
> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> Following steps detailed in previous recommendations began proces of
> replacing and healngbricks one node at a time.
>
> 1) kill pid of brick
> 2) reconfigure brick from raid6 to raid10
> 3) recreate directory of brick
> 4) gluster volume start <> force
> 5) gluster volume heal <> full
>
> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
> little heavy but nothing shocking.
>
> About an hour after node 1 finished I began same process on node2.
> Heal proces kicked in as before and the files in directories visible from
> mount and .glusterfs healed in short time.  Then it began crawl of .shard
> adding those files to heal count at which point the entire proces ground 
> to
> a halt basically.  After 48 hours out of 19k shards it has added 5900 to
> heal list.  Load on all 3 machnes is negligible.   It was suggested to
> change this value to full cluster.data-self-heal-algorithm and
> restart volume which I did.  No efffect.  Tried relaunching heal no 
> effect,
> despite any node picked.  I started each VM and performed a stat of all
> files from within it, or a full virus scan  

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Darrell Budic
I noticed that my new brick (replacement disk) did not have a .shard directory 
created on the brick, if that helps. 

I removed the affected brick from the volume and then wiped the disk, did an 
add-brick, and everything healed right up. I didn’t try and set any attrs or 
anything else, just removed and added the brick as new.

> On Aug 29, 2016, at 9:49 AM, Darrell Budic  wrote:
> 
> Just to let you know I’m seeing the same issue under 3.7.14 on CentOS 7. Some 
> content was healed correctly, now all the shards are queued up in a heal 
> list, but nothing is healing. Got similar brick errors logged to the ones 
> David was getting on the brick that isn’t healing:
> 
> [2016-08-29 03:31:40.436110] E [MSGID: 115050] 
> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822: LOOKUP 
> (null) 
> (----/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29)
>  ==> (Invalid argument) [Invalid argument]
> [2016-08-29 03:31:43.005013] E [MSGID: 115050] 
> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802: LOOKUP 
> (null) 
> (----/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40)
>  ==> (Invalid argument) [Invalid argument]
> 
> This was after replacing the drive the brick was on and trying to get it back 
> into the system by setting the volume's fattr on the brick dir. I’ll try the 
> suggested method here on it it shortly.
> 
>   -Darrell
> 
> 
>> On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay > > wrote:
>> 
>> Got it. Thanks.
>> 
>> I tried the same test and shd crashed with SIGABRT (well, that's because I 
>> compiled from src with -DDEBUG).
>> In any case, this error would prevent full heal from proceeding further.
>> I'm debugging the crash now. Will let you know when I have the RC.
>> 
>> -Krutika
>> 
>> On Mon, Aug 29, 2016 at 5:47 PM, David Gossage > > wrote:
>> 
>> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage > > wrote:
>> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay > > wrote:
>> Could you attach both client and brick logs? Meanwhile I will try these 
>> steps out on my machines and see if it is easily recreatable.
>> 
>> 
>> Hoping 7z files are accepted by mail server.
>> 
>> looks like zip file awaiting approval due to size 
>> 
>> -Krutika
>> 
>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage > > wrote:
>> Centos 7 Gluster 3.8.3
>> 
>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
>> Options Reconfigured:
>> cluster.data-self-heal-algorithm: full
>> cluster.self-heal-daemon: on
>> cluster.locking-scheme: granular
>> features.shard-block-size: 64MB
>> features.shard: on
>> performance.readdir-ahead: on
>> storage.owner-uid: 36
>> storage.owner-gid: 36
>> performance.quick-read: off
>> performance.read-ahead: off
>> performance.io -cache: off
>> performance.stat-prefetch: on
>> cluster.eager-lock: enable
>> network.remote-dio: enable
>> cluster.quorum-type: auto
>> cluster.server-quorum-type: server
>> server.allow-insecure: on
>> cluster.self-heal-window-size: 1024
>> cluster.background-self-heal-count: 16
>> performance.strict-write-ordering: off
>> nfs.disable: on
>> nfs.addr-namelookup: off
>> nfs.enable-ino32: off
>> cluster.granular-entry-heal: on
>> 
>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
>> Following steps detailed in previous recommendations began proces of 
>> replacing and healngbricks one node at a time.
>> 
>> 1) kill pid of brick
>> 2) reconfigure brick from raid6 to raid10
>> 3) recreate directory of brick
>> 4) gluster volume start <> force
>> 5) gluster volume heal <> full
>> 
>> 1st node worked as expected took 12 hours to heal 1TB data.  Load was little 
>> heavy but nothing shocking.
>> 
>> About an hour after node 1 finished I began same process on node2.  Heal 
>> proces kicked in as before and the files in directories visible from mount 
>> and .glusterfs healed in short time.  Then it began crawl of .shard adding 
>> those files to heal count at which point the entire proces ground to a halt 
>> basically.  After 48 hours out of 19k shards it has added 5900 to heal list. 
>>  Load on all 3 machnes is negligible.   It was suggested to change this 
>> value to full cluster.data-self-heal-algorithm and restart volume which I 
>> did.  No efffect.  Tried relaunching heal no effect, despite any node 
>> picked.  I started each VM and performed a stat of all files from within it, 
>> or a full virus scan  and that seemed to cause short small spikes in shards 
>> added, but not by much.  Logs are showing no real messages indicating 
>> anything is going 

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread David Gossage
On Mon, Aug 29, 2016 at 7:01 AM, Anuradha Talur <ata...@redhat.com> wrote:

>
>
> - Original Message -
> > From: "David Gossage" <dgoss...@carouselchecks.com>
> > To: "Anuradha Talur" <ata...@redhat.com>
> > Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org>,
> "Krutika Dhananjay" <kdhan...@redhat.com>
> > Sent: Monday, August 29, 2016 5:12:42 PM
> > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> >
> > On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur <ata...@redhat.com>
> wrote:
> >
> > > Response inline.
> > >
> > > - Original Message -
> > > > From: "Krutika Dhananjay" <kdhan...@redhat.com>
> > > > To: "David Gossage" <dgoss...@carouselchecks.com>
> > > > Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org>
> > > > Sent: Monday, August 29, 2016 3:55:04 PM
> > > > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> > > >
> > > > Could you attach both client and brick logs? Meanwhile I will try
> these
> > > steps
> > > > out on my machines and see if it is easily recreatable.
> > > >
> > > > -Krutika
> > > >
> > > > On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
> > > dgoss...@carouselchecks.com
> > > > > wrote:
> > > >
> > > >
> > > >
> > > > Centos 7 Gluster 3.8.3
> > > >
> > > > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> > > > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> > > > Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> > > > Options Reconfigured:
> > > > cluster.data-self-heal-algorithm: full
> > > > cluster.self-heal-daemon: on
> > > > cluster.locking-scheme: granular
> > > > features.shard-block-size: 64MB
> > > > features.shard: on
> > > > performance.readdir-ahead: on
> > > > storage.owner-uid: 36
> > > > storage.owner-gid: 36
> > > > performance.quick-read: off
> > > > performance.read-ahead: off
> > > > performance.io-cache: off
> > > > performance.stat-prefetch: on
> > > > cluster.eager-lock: enable
> > > > network.remote-dio: enable
> > > > cluster.quorum-type: auto
> > > > cluster.server-quorum-type: server
> > > > server.allow-insecure: on
> > > > cluster.self-heal-window-size: 1024
> > > > cluster.background-self-heal-count: 16
> > > > performance.strict-write-ordering: off
> > > > nfs.disable: on
> > > > nfs.addr-namelookup: off
> > > > nfs.enable-ino32: off
> > > > cluster.granular-entry-heal: on
> > > >
> > > > Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> > > > Following steps detailed in previous recommendations began proces of
> > > > replacing and healngbricks one node at a time.
> > > >
> > > > 1) kill pid of brick
> > > > 2) reconfigure brick from raid6 to raid10
> > > > 3) recreate directory of brick
> > > > 4) gluster volume start <> force
> > > > 5) gluster volume heal <> full
> > > Hi,
> > >
> > > I'd suggest that full heal is not used. There are a few bugs in full
> heal.
> > > Better safe than sorry ;)
> > > Instead I'd suggest the following steps:
> > >
> > > Currently I brought the node down by systemctl stop glusterd as I was
> > getting sporadic io issues and a few VM's paused so hoping that will
> help.
> > I may wait to do this till around 4PM when most work is done in case it
> > shoots load up.
> >
> >
> > > 1) kill pid of brick
> > > 2) to configuring of brick that you need
> > > 3) recreate brick dir
> > > 4) while the brick is still down, from the mount point:
> > >a) create a dummy non existent dir under / of mount.
> > >
> >
> > so if noee 2 is down brick, pick node for example 3 and make a test dir
> > under its brick directory that doesnt exist on 2 or should I be dong this
> > over a gluster mount?
> You should be doing this over gluster mount.
> >
> > >b) set a non existent extended attribute on / of mount.
> > >
> >
> > Could you give me an example of an attribute to set?   I've read a tad on
> > this, and looked up attributes but

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Darrell Budic
Just to let you know I’m seeing the same issue under 3.7.14 on CentOS 7. Some 
content was healed correctly, now all the shards are queued up in a heal list, 
but nothing is healing. Got similar brick errors logged to the ones David was 
getting on the brick that isn’t healing:

[2016-08-29 03:31:40.436110] E [MSGID: 115050] 
[server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822: LOOKUP 
(null) 
(----/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29) 
==> (Invalid argument) [Invalid argument]
[2016-08-29 03:31:43.005013] E [MSGID: 115050] 
[server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802: LOOKUP 
(null) 
(----/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40) 
==> (Invalid argument) [Invalid argument]

This was after replacing the drive the brick was on and trying to get it back 
into the system by setting the volume's fattr on the brick dir. I’ll try the 
suggested method here on it it shortly.

  -Darrell


> On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay  wrote:
> 
> Got it. Thanks.
> 
> I tried the same test and shd crashed with SIGABRT (well, that's because I 
> compiled from src with -DDEBUG).
> In any case, this error would prevent full heal from proceeding further.
> I'm debugging the crash now. Will let you know when I have the RC.
> 
> -Krutika
> 
> On Mon, Aug 29, 2016 at 5:47 PM, David Gossage  > wrote:
> 
> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage  > wrote:
> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay  > wrote:
> Could you attach both client and brick logs? Meanwhile I will try these steps 
> out on my machines and see if it is easily recreatable.
> 
> 
> Hoping 7z files are accepted by mail server.
> 
> looks like zip file awaiting approval due to size 
> 
> -Krutika
> 
> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage  > wrote:
> Centos 7 Gluster 3.8.3
> 
> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> Options Reconfigured:
> cluster.data-self-heal-algorithm: full
> cluster.self-heal-daemon: on
> cluster.locking-scheme: granular
> features.shard-block-size: 64MB
> features.shard: on
> performance.readdir-ahead: on
> storage.owner-uid: 36
> storage.owner-gid: 36
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: on
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> server.allow-insecure: on
> cluster.self-heal-window-size: 1024
> cluster.background-self-heal-count: 16
> performance.strict-write-ordering: off
> nfs.disable: on
> nfs.addr-namelookup: off
> nfs.enable-ino32: off
> cluster.granular-entry-heal: on
> 
> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> Following steps detailed in previous recommendations began proces of 
> replacing and healngbricks one node at a time.
> 
> 1) kill pid of brick
> 2) reconfigure brick from raid6 to raid10
> 3) recreate directory of brick
> 4) gluster volume start <> force
> 5) gluster volume heal <> full
> 
> 1st node worked as expected took 12 hours to heal 1TB data.  Load was little 
> heavy but nothing shocking.
> 
> About an hour after node 1 finished I began same process on node2.  Heal 
> proces kicked in as before and the files in directories visible from mount 
> and .glusterfs healed in short time.  Then it began crawl of .shard adding 
> those files to heal count at which point the entire proces ground to a halt 
> basically.  After 48 hours out of 19k shards it has added 5900 to heal list.  
> Load on all 3 machnes is negligible.   It was suggested to change this value 
> to full cluster.data-self-heal-algorithm and restart volume which I did.  No 
> efffect.  Tried relaunching heal no effect, despite any node picked.  I 
> started each VM and performed a stat of all files from within it, or a full 
> virus scan  and that seemed to cause short small spikes in shards added, but 
> not by much.  Logs are showing no real messages indicating anything is going 
> on.  I get hits to brick log on occasion of null lookups making me think its 
> not really crawling shards directory but waiting for a shard lookup to add 
> it.  I'll get following in brick log but not constant and sometime multiple 
> for same shard.
> 
> [2016-08-29 08:31:57.478125] W [MSGID: 115009] 
> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution type 
> for (null) (LOOKUP)
> [2016-08-29 08:31:57.478170] E [MSGID: 115050] 
> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783: LOOKUP 
> (null) (---00
> 

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Krutika Dhananjay
Got it. Thanks.

I tried the same test and shd crashed with SIGABRT (well, that's because I
compiled from src with -DDEBUG).
In any case, this error would prevent full heal from proceeding further.
I'm debugging the crash now. Will let you know when I have the RC.

-Krutika

On Mon, Aug 29, 2016 at 5:47 PM, David Gossage 
wrote:

>
> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage <
> dgoss...@carouselchecks.com> wrote:
>
>> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay 
>> wrote:
>>
>>> Could you attach both client and brick logs? Meanwhile I will try these
>>> steps out on my machines and see if it is easily recreatable.
>>>
>>>
>> Hoping 7z files are accepted by mail server.
>>
>
> looks like zip file awaiting approval due to size
>
>>
>> -Krutika
>>>
>>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
>>> dgoss...@carouselchecks.com> wrote:
>>>
 Centos 7 Gluster 3.8.3

 Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
 Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
 Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
 Options Reconfigured:
 cluster.data-self-heal-algorithm: full
 cluster.self-heal-daemon: on
 cluster.locking-scheme: granular
 features.shard-block-size: 64MB
 features.shard: on
 performance.readdir-ahead: on
 storage.owner-uid: 36
 storage.owner-gid: 36
 performance.quick-read: off
 performance.read-ahead: off
 performance.io-cache: off
 performance.stat-prefetch: on
 cluster.eager-lock: enable
 network.remote-dio: enable
 cluster.quorum-type: auto
 cluster.server-quorum-type: server
 server.allow-insecure: on
 cluster.self-heal-window-size: 1024
 cluster.background-self-heal-count: 16
 performance.strict-write-ordering: off
 nfs.disable: on
 nfs.addr-namelookup: off
 nfs.enable-ino32: off
 cluster.granular-entry-heal: on

 Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
 Following steps detailed in previous recommendations began proces of
 replacing and healngbricks one node at a time.

 1) kill pid of brick
 2) reconfigure brick from raid6 to raid10
 3) recreate directory of brick
 4) gluster volume start <> force
 5) gluster volume heal <> full

 1st node worked as expected took 12 hours to heal 1TB data.  Load was
 little heavy but nothing shocking.

 About an hour after node 1 finished I began same process on node2.
 Heal proces kicked in as before and the files in directories visible from
 mount and .glusterfs healed in short time.  Then it began crawl of .shard
 adding those files to heal count at which point the entire proces ground to
 a halt basically.  After 48 hours out of 19k shards it has added 5900 to
 heal list.  Load on all 3 machnes is negligible.   It was suggested to
 change this value to full cluster.data-self-heal-algorithm and restart
 volume which I did.  No efffect.  Tried relaunching heal no effect, despite
 any node picked.  I started each VM and performed a stat of all files from
 within it, or a full virus scan  and that seemed to cause short small
 spikes in shards added, but not by much.  Logs are showing no real messages
 indicating anything is going on.  I get hits to brick log on occasion of
 null lookups making me think its not really crawling shards directory but
 waiting for a shard lookup to add it.  I'll get following in brick log but
 not constant and sometime multiple for same shard.

 [2016-08-29 08:31:57.478125] W [MSGID: 115009]
 [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution
 type for (null) (LOOKUP)
 [2016-08-29 08:31:57.478170] E [MSGID: 115050]
 [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
 LOOKUP (null) (---00
 00-/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
 argument) [Invalid argument]

 This one repeated about 30 times in row then nothing for 10 minutes
 then one hit for one different shard by itself.

 How can I determine if Heal is actually running?  How can I kill it or
 force restart?  Does node I start it from determine which directory gets
 crawled to determine heals?

 *David Gossage*
 *Carousel Checks Inc. | System Administrator*
 *Office* 708.613.2284

 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-users

>>>
>>>
>>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread David Gossage
On Mon, Aug 29, 2016 at 7:14 AM, David Gossage 
wrote:

> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay 
> wrote:
>
>> Could you attach both client and brick logs? Meanwhile I will try these
>> steps out on my machines and see if it is easily recreatable.
>>
>>
> Hoping 7z files are accepted by mail server.
>

looks like zip file awaiting approval due to size

>
> -Krutika
>>
>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
>> dgoss...@carouselchecks.com> wrote:
>>
>>> Centos 7 Gluster 3.8.3
>>>
>>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
>>> Options Reconfigured:
>>> cluster.data-self-heal-algorithm: full
>>> cluster.self-heal-daemon: on
>>> cluster.locking-scheme: granular
>>> features.shard-block-size: 64MB
>>> features.shard: on
>>> performance.readdir-ahead: on
>>> storage.owner-uid: 36
>>> storage.owner-gid: 36
>>> performance.quick-read: off
>>> performance.read-ahead: off
>>> performance.io-cache: off
>>> performance.stat-prefetch: on
>>> cluster.eager-lock: enable
>>> network.remote-dio: enable
>>> cluster.quorum-type: auto
>>> cluster.server-quorum-type: server
>>> server.allow-insecure: on
>>> cluster.self-heal-window-size: 1024
>>> cluster.background-self-heal-count: 16
>>> performance.strict-write-ordering: off
>>> nfs.disable: on
>>> nfs.addr-namelookup: off
>>> nfs.enable-ino32: off
>>> cluster.granular-entry-heal: on
>>>
>>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
>>> Following steps detailed in previous recommendations began proces of
>>> replacing and healngbricks one node at a time.
>>>
>>> 1) kill pid of brick
>>> 2) reconfigure brick from raid6 to raid10
>>> 3) recreate directory of brick
>>> 4) gluster volume start <> force
>>> 5) gluster volume heal <> full
>>>
>>> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
>>> little heavy but nothing shocking.
>>>
>>> About an hour after node 1 finished I began same process on node2.  Heal
>>> proces kicked in as before and the files in directories visible from mount
>>> and .glusterfs healed in short time.  Then it began crawl of .shard adding
>>> those files to heal count at which point the entire proces ground to a halt
>>> basically.  After 48 hours out of 19k shards it has added 5900 to heal
>>> list.  Load on all 3 machnes is negligible.   It was suggested to change
>>> this value to full cluster.data-self-heal-algorithm and restart volume
>>> which I did.  No efffect.  Tried relaunching heal no effect, despite any
>>> node picked.  I started each VM and performed a stat of all files from
>>> within it, or a full virus scan  and that seemed to cause short small
>>> spikes in shards added, but not by much.  Logs are showing no real messages
>>> indicating anything is going on.  I get hits to brick log on occasion of
>>> null lookups making me think its not really crawling shards directory but
>>> waiting for a shard lookup to add it.  I'll get following in brick log but
>>> not constant and sometime multiple for same shard.
>>>
>>> [2016-08-29 08:31:57.478125] W [MSGID: 115009]
>>> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution
>>> type for (null) (LOOKUP)
>>> [2016-08-29 08:31:57.478170] E [MSGID: 115050]
>>> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
>>> LOOKUP (null) (---00
>>> 00-/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
>>> argument) [Invalid argument]
>>>
>>> This one repeated about 30 times in row then nothing for 10 minutes then
>>> one hit for one different shard by itself.
>>>
>>> How can I determine if Heal is actually running?  How can I kill it or
>>> force restart?  Does node I start it from determine which directory gets
>>> crawled to determine heals?
>>>
>>> *David Gossage*
>>> *Carousel Checks Inc. | System Administrator*
>>> *Office* 708.613.2284
>>>
>>> ___
>>> Gluster-users mailing list
>>> Gluster-users@gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread David Gossage
On Mon, Aug 29, 2016 at 7:14 AM, David Gossage 
wrote:

> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay 
> wrote:
>
>> Could you attach both client and brick logs? Meanwhile I will try these
>> steps out on my machines and see if it is easily recreatable.
>>
>>
> Hoping 7z files are accepted by mail server.
>

Also didnt do translation of timezones but in CST I started the node 1 heal
2016-08-26 20:26:42, and then the next morning i started initall node 2
heal at  2016-08-27 07:58:34

>
> -Krutika
>>
>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
>> dgoss...@carouselchecks.com> wrote:
>>
>>> Centos 7 Gluster 3.8.3
>>>
>>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
>>> Options Reconfigured:
>>> cluster.data-self-heal-algorithm: full
>>> cluster.self-heal-daemon: on
>>> cluster.locking-scheme: granular
>>> features.shard-block-size: 64MB
>>> features.shard: on
>>> performance.readdir-ahead: on
>>> storage.owner-uid: 36
>>> storage.owner-gid: 36
>>> performance.quick-read: off
>>> performance.read-ahead: off
>>> performance.io-cache: off
>>> performance.stat-prefetch: on
>>> cluster.eager-lock: enable
>>> network.remote-dio: enable
>>> cluster.quorum-type: auto
>>> cluster.server-quorum-type: server
>>> server.allow-insecure: on
>>> cluster.self-heal-window-size: 1024
>>> cluster.background-self-heal-count: 16
>>> performance.strict-write-ordering: off
>>> nfs.disable: on
>>> nfs.addr-namelookup: off
>>> nfs.enable-ino32: off
>>> cluster.granular-entry-heal: on
>>>
>>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
>>> Following steps detailed in previous recommendations began proces of
>>> replacing and healngbricks one node at a time.
>>>
>>> 1) kill pid of brick
>>> 2) reconfigure brick from raid6 to raid10
>>> 3) recreate directory of brick
>>> 4) gluster volume start <> force
>>> 5) gluster volume heal <> full
>>>
>>> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
>>> little heavy but nothing shocking.
>>>
>>> About an hour after node 1 finished I began same process on node2.  Heal
>>> proces kicked in as before and the files in directories visible from mount
>>> and .glusterfs healed in short time.  Then it began crawl of .shard adding
>>> those files to heal count at which point the entire proces ground to a halt
>>> basically.  After 48 hours out of 19k shards it has added 5900 to heal
>>> list.  Load on all 3 machnes is negligible.   It was suggested to change
>>> this value to full cluster.data-self-heal-algorithm and restart volume
>>> which I did.  No efffect.  Tried relaunching heal no effect, despite any
>>> node picked.  I started each VM and performed a stat of all files from
>>> within it, or a full virus scan  and that seemed to cause short small
>>> spikes in shards added, but not by much.  Logs are showing no real messages
>>> indicating anything is going on.  I get hits to brick log on occasion of
>>> null lookups making me think its not really crawling shards directory but
>>> waiting for a shard lookup to add it.  I'll get following in brick log but
>>> not constant and sometime multiple for same shard.
>>>
>>> [2016-08-29 08:31:57.478125] W [MSGID: 115009]
>>> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution
>>> type for (null) (LOOKUP)
>>> [2016-08-29 08:31:57.478170] E [MSGID: 115050]
>>> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
>>> LOOKUP (null) (---00
>>> 00-/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
>>> argument) [Invalid argument]
>>>
>>> This one repeated about 30 times in row then nothing for 10 minutes then
>>> one hit for one different shard by itself.
>>>
>>> How can I determine if Heal is actually running?  How can I kill it or
>>> force restart?  Does node I start it from determine which directory gets
>>> crawled to determine heals?
>>>
>>> *David Gossage*
>>> *Carousel Checks Inc. | System Administrator*
>>> *Office* 708.613.2284
>>>
>>> ___
>>> Gluster-users mailing list
>>> Gluster-users@gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread David Gossage
On Mon, Aug 29, 2016 at 7:01 AM, Anuradha Talur <ata...@redhat.com> wrote:

>
>
> - Original Message -
> > From: "David Gossage" <dgoss...@carouselchecks.com>
> > To: "Anuradha Talur" <ata...@redhat.com>
> > Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org>,
> "Krutika Dhananjay" <kdhan...@redhat.com>
> > Sent: Monday, August 29, 2016 5:12:42 PM
> > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> >
> > On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur <ata...@redhat.com>
> wrote:
> >
> > > Response inline.
> > >
> > > - Original Message -
> > > > From: "Krutika Dhananjay" <kdhan...@redhat.com>
> > > > To: "David Gossage" <dgoss...@carouselchecks.com>
> > > > Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org>
> > > > Sent: Monday, August 29, 2016 3:55:04 PM
> > > > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> > > >
> > > > Could you attach both client and brick logs? Meanwhile I will try
> these
> > > steps
> > > > out on my machines and see if it is easily recreatable.
> > > >
> > > > -Krutika
> > > >
> > > > On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
> > > dgoss...@carouselchecks.com
> > > > > wrote:
> > > >
> > > >
> > > >
> > > > Centos 7 Gluster 3.8.3
> > > >
> > > > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> > > > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> > > > Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> > > > Options Reconfigured:
> > > > cluster.data-self-heal-algorithm: full
> > > > cluster.self-heal-daemon: on
> > > > cluster.locking-scheme: granular
> > > > features.shard-block-size: 64MB
> > > > features.shard: on
> > > > performance.readdir-ahead: on
> > > > storage.owner-uid: 36
> > > > storage.owner-gid: 36
> > > > performance.quick-read: off
> > > > performance.read-ahead: off
> > > > performance.io-cache: off
> > > > performance.stat-prefetch: on
> > > > cluster.eager-lock: enable
> > > > network.remote-dio: enable
> > > > cluster.quorum-type: auto
> > > > cluster.server-quorum-type: server
> > > > server.allow-insecure: on
> > > > cluster.self-heal-window-size: 1024
> > > > cluster.background-self-heal-count: 16
> > > > performance.strict-write-ordering: off
> > > > nfs.disable: on
> > > > nfs.addr-namelookup: off
> > > > nfs.enable-ino32: off
> > > > cluster.granular-entry-heal: on
> > > >
> > > > Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> > > > Following steps detailed in previous recommendations began proces of
> > > > replacing and healngbricks one node at a time.
> > > >
> > > > 1) kill pid of brick
> > > > 2) reconfigure brick from raid6 to raid10
> > > > 3) recreate directory of brick
> > > > 4) gluster volume start <> force
> > > > 5) gluster volume heal <> full
> > > Hi,
> > >
> > > I'd suggest that full heal is not used. There are a few bugs in full
> heal.
> > > Better safe than sorry ;)
> > > Instead I'd suggest the following steps:
> > >
> > > Currently I brought the node down by systemctl stop glusterd as I was
> > getting sporadic io issues and a few VM's paused so hoping that will
> help.
> > I may wait to do this till around 4PM when most work is done in case it
> > shoots load up.
> >
> >
> > > 1) kill pid of brick
> > > 2) to configuring of brick that you need
> > > 3) recreate brick dir
> > > 4) while the brick is still down, from the mount point:
> > >a) create a dummy non existent dir under / of mount.
> > >
> >
> > so if noee 2 is down brick, pick node for example 3 and make a test dir
> > under its brick directory that doesnt exist on 2 or should I be dong this
> > over a gluster mount?
> You should be doing this over gluster mount.
> >
> > >b) set a non existent extended attribute on / of mount.
> > >
> >
> > Could you give me an example of an attribute to set?   I've read a tad on
> > this, and looked up attributes but have

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Anuradha Talur


- Original Message -
> From: "David Gossage" <dgoss...@carouselchecks.com>
> To: "Anuradha Talur" <ata...@redhat.com>
> Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org>, "Krutika 
> Dhananjay" <kdhan...@redhat.com>
> Sent: Monday, August 29, 2016 5:12:42 PM
> Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> 
> On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur <ata...@redhat.com> wrote:
> 
> > Response inline.
> >
> > - Original Message -
> > > From: "Krutika Dhananjay" <kdhan...@redhat.com>
> > > To: "David Gossage" <dgoss...@carouselchecks.com>
> > > Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org>
> > > Sent: Monday, August 29, 2016 3:55:04 PM
> > > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> > >
> > > Could you attach both client and brick logs? Meanwhile I will try these
> > steps
> > > out on my machines and see if it is easily recreatable.
> > >
> > > -Krutika
> > >
> > > On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
> > dgoss...@carouselchecks.com
> > > > wrote:
> > >
> > >
> > >
> > > Centos 7 Gluster 3.8.3
> > >
> > > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> > > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> > > Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> > > Options Reconfigured:
> > > cluster.data-self-heal-algorithm: full
> > > cluster.self-heal-daemon: on
> > > cluster.locking-scheme: granular
> > > features.shard-block-size: 64MB
> > > features.shard: on
> > > performance.readdir-ahead: on
> > > storage.owner-uid: 36
> > > storage.owner-gid: 36
> > > performance.quick-read: off
> > > performance.read-ahead: off
> > > performance.io-cache: off
> > > performance.stat-prefetch: on
> > > cluster.eager-lock: enable
> > > network.remote-dio: enable
> > > cluster.quorum-type: auto
> > > cluster.server-quorum-type: server
> > > server.allow-insecure: on
> > > cluster.self-heal-window-size: 1024
> > > cluster.background-self-heal-count: 16
> > > performance.strict-write-ordering: off
> > > nfs.disable: on
> > > nfs.addr-namelookup: off
> > > nfs.enable-ino32: off
> > > cluster.granular-entry-heal: on
> > >
> > > Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> > > Following steps detailed in previous recommendations began proces of
> > > replacing and healngbricks one node at a time.
> > >
> > > 1) kill pid of brick
> > > 2) reconfigure brick from raid6 to raid10
> > > 3) recreate directory of brick
> > > 4) gluster volume start <> force
> > > 5) gluster volume heal <> full
> > Hi,
> >
> > I'd suggest that full heal is not used. There are a few bugs in full heal.
> > Better safe than sorry ;)
> > Instead I'd suggest the following steps:
> >
> > Currently I brought the node down by systemctl stop glusterd as I was
> getting sporadic io issues and a few VM's paused so hoping that will help.
> I may wait to do this till around 4PM when most work is done in case it
> shoots load up.
> 
> 
> > 1) kill pid of brick
> > 2) to configuring of brick that you need
> > 3) recreate brick dir
> > 4) while the brick is still down, from the mount point:
> >a) create a dummy non existent dir under / of mount.
> >
> 
> so if noee 2 is down brick, pick node for example 3 and make a test dir
> under its brick directory that doesnt exist on 2 or should I be dong this
> over a gluster mount?
You should be doing this over gluster mount.
> 
> >b) set a non existent extended attribute on / of mount.
> >
> 
> Could you give me an example of an attribute to set?   I've read a tad on
> this, and looked up attributes but haven't set any yet myself.
> 
Sure. setfattr -n "user.some-name" -v "some-value" 
> Doing these steps will ensure that heal happens only from updated brick to
> > down brick.
> > 5) gluster v start <> force
> > 6) gluster v heal <>
> >
> 
> Will it matter if somewhere in gluster the full heal command was run other
> day?  Not sure if it eventually stops or times out.
> 
full heal will stop once the crawl is done. So if you want to trigger heal 
again,
run gluster v heal <>. Actually even brick up or volume start

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread David Gossage
On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur <ata...@redhat.com> wrote:

> Response inline.
>
> - Original Message -
> > From: "Krutika Dhananjay" <kdhan...@redhat.com>
> > To: "David Gossage" <dgoss...@carouselchecks.com>
> > Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org>
> > Sent: Monday, August 29, 2016 3:55:04 PM
> > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> >
> > Could you attach both client and brick logs? Meanwhile I will try these
> steps
> > out on my machines and see if it is easily recreatable.
> >
> > -Krutika
> >
> > On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
> dgoss...@carouselchecks.com
> > > wrote:
> >
> >
> >
> > Centos 7 Gluster 3.8.3
> >
> > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> > Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> > Options Reconfigured:
> > cluster.data-self-heal-algorithm: full
> > cluster.self-heal-daemon: on
> > cluster.locking-scheme: granular
> > features.shard-block-size: 64MB
> > features.shard: on
> > performance.readdir-ahead: on
> > storage.owner-uid: 36
> > storage.owner-gid: 36
> > performance.quick-read: off
> > performance.read-ahead: off
> > performance.io-cache: off
> > performance.stat-prefetch: on
> > cluster.eager-lock: enable
> > network.remote-dio: enable
> > cluster.quorum-type: auto
> > cluster.server-quorum-type: server
> > server.allow-insecure: on
> > cluster.self-heal-window-size: 1024
> > cluster.background-self-heal-count: 16
> > performance.strict-write-ordering: off
> > nfs.disable: on
> > nfs.addr-namelookup: off
> > nfs.enable-ino32: off
> > cluster.granular-entry-heal: on
> >
> > Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> > Following steps detailed in previous recommendations began proces of
> > replacing and healngbricks one node at a time.
> >
> > 1) kill pid of brick
> > 2) reconfigure brick from raid6 to raid10
> > 3) recreate directory of brick
> > 4) gluster volume start <> force
> > 5) gluster volume heal <> full
> Hi,
>
> I'd suggest that full heal is not used. There are a few bugs in full heal.
> Better safe than sorry ;)
> Instead I'd suggest the following steps:
>
> Currently I brought the node down by systemctl stop glusterd as I was
getting sporadic io issues and a few VM's paused so hoping that will help.
I may wait to do this till around 4PM when most work is done in case it
shoots load up.


> 1) kill pid of brick
> 2) to configuring of brick that you need
> 3) recreate brick dir
> 4) while the brick is still down, from the mount point:
>a) create a dummy non existent dir under / of mount.
>

so if noee 2 is down brick, pick node for example 3 and make a test dir
under its brick directory that doesnt exist on 2 or should I be dong this
over a gluster mount?

>b) set a non existent extended attribute on / of mount.
>

Could you give me an example of an attribute to set?   I've read a tad on
this, and looked up attributes but haven't set any yet myself.

Doing these steps will ensure that heal happens only from updated brick to
> down brick.
> 5) gluster v start <> force
> 6) gluster v heal <>
>

Will it matter if somewhere in gluster the full heal command was run other
day?  Not sure if it eventually stops or times out.

>
> > 1st node worked as expected took 12 hours to heal 1TB data. Load was
> little
> > heavy but nothing shocking.
> >
> > About an hour after node 1 finished I began same process on node2. Heal
> > proces kicked in as before and the files in directories visible from
> mount
> > and .glusterfs healed in short time. Then it began crawl of .shard adding
> > those files to heal count at which point the entire proces ground to a
> halt
> > basically. After 48 hours out of 19k shards it has added 5900 to heal
> list.
> > Load on all 3 machnes is negligible. It was suggested to change this
> value
> > to full cluster.data-self-heal-algorithm and restart volume which I
> did. No
> > efffect. Tried relaunching heal no effect, despite any node picked. I
> > started each VM and performed a stat of all files from within it, or a
> full
> > virus scan and that seemed to cause short small spikes in shards added,
> but
> > not by much. Logs are showing no real messages indicating anything is
> going
> > on. I get hits to brick log on occasion of null lookups making me think
> it

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Anuradha Talur
Response inline.

- Original Message -
> From: "Krutika Dhananjay" <kdhan...@redhat.com>
> To: "David Gossage" <dgoss...@carouselchecks.com>
> Cc: "gluster-users@gluster.org List" <Gluster-users@gluster.org>
> Sent: Monday, August 29, 2016 3:55:04 PM
> Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> 
> Could you attach both client and brick logs? Meanwhile I will try these steps
> out on my machines and see if it is easily recreatable.
> 
> -Krutika
> 
> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage < dgoss...@carouselchecks.com
> > wrote:
> 
> 
> 
> Centos 7 Gluster 3.8.3
> 
> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> Options Reconfigured:
> cluster.data-self-heal-algorithm: full
> cluster.self-heal-daemon: on
> cluster.locking-scheme: granular
> features.shard-block-size: 64MB
> features.shard: on
> performance.readdir-ahead: on
> storage.owner-uid: 36
> storage.owner-gid: 36
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: on
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> server.allow-insecure: on
> cluster.self-heal-window-size: 1024
> cluster.background-self-heal-count: 16
> performance.strict-write-ordering: off
> nfs.disable: on
> nfs.addr-namelookup: off
> nfs.enable-ino32: off
> cluster.granular-entry-heal: on
> 
> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> Following steps detailed in previous recommendations began proces of
> replacing and healngbricks one node at a time.
> 
> 1) kill pid of brick
> 2) reconfigure brick from raid6 to raid10
> 3) recreate directory of brick
> 4) gluster volume start <> force
> 5) gluster volume heal <> full
Hi,

I'd suggest that full heal is not used. There are a few bugs in full heal.
Better safe than sorry ;)
Instead I'd suggest the following steps:

1) kill pid of brick
2) to configuring of brick that you need
3) recreate brick dir
4) while the brick is still down, from the mount point:
   a) create a dummy non existent dir under / of mount.
   b) set a non existent extended attribute on / of mount.
Doing these steps will ensure that heal happens only from updated brick to down 
brick.
5) gluster v start <> force
6) gluster v heal <>
> 
> 1st node worked as expected took 12 hours to heal 1TB data. Load was little
> heavy but nothing shocking.
> 
> About an hour after node 1 finished I began same process on node2. Heal
> proces kicked in as before and the files in directories visible from mount
> and .glusterfs healed in short time. Then it began crawl of .shard adding
> those files to heal count at which point the entire proces ground to a halt
> basically. After 48 hours out of 19k shards it has added 5900 to heal list.
> Load on all 3 machnes is negligible. It was suggested to change this value
> to full cluster.data-self-heal-algorithm and restart volume which I did. No
> efffect. Tried relaunching heal no effect, despite any node picked. I
> started each VM and performed a stat of all files from within it, or a full
> virus scan and that seemed to cause short small spikes in shards added, but
> not by much. Logs are showing no real messages indicating anything is going
> on. I get hits to brick log on occasion of null lookups making me think its
> not really crawling shards directory but waiting for a shard lookup to add
> it. I'll get following in brick log but not constant and sometime multiple
> for same shard.
> 
> [2016-08-29 08:31:57.478125] W [MSGID: 115009]
> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution type
> for (null) (LOOKUP)
> [2016-08-29 08:31:57.478170] E [MSGID: 115050]
> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
> LOOKUP (null) (---00
> 00-/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
> argument) [Invalid argument]
> 
> This one repeated about 30 times in row then nothing for 10 minutes then one
> hit for one different shard by itself.
> 
> How can I determine if Heal is actually running? How can I kill it or force
> restart? Does node I start it from determine which directory gets crawled to
> determine heals?
> 
> David Gossage
> Carousel Checks Inc. | System Administrator
> Office 708.613.2284
> 
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
> 
> 
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

-- 
Thanks,
Anuradha.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Krutika Dhananjay
Could you attach both client and brick logs? Meanwhile I will try these
steps out on my machines and see if it is easily recreatable.

-Krutika

On Mon, Aug 29, 2016 at 2:31 PM, David Gossage 
wrote:

> Centos 7 Gluster 3.8.3
>
> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> Options Reconfigured:
> cluster.data-self-heal-algorithm: full
> cluster.self-heal-daemon: on
> cluster.locking-scheme: granular
> features.shard-block-size: 64MB
> features.shard: on
> performance.readdir-ahead: on
> storage.owner-uid: 36
> storage.owner-gid: 36
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: on
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> server.allow-insecure: on
> cluster.self-heal-window-size: 1024
> cluster.background-self-heal-count: 16
> performance.strict-write-ordering: off
> nfs.disable: on
> nfs.addr-namelookup: off
> nfs.enable-ino32: off
> cluster.granular-entry-heal: on
>
> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> Following steps detailed in previous recommendations began proces of
> replacing and healngbricks one node at a time.
>
> 1) kill pid of brick
> 2) reconfigure brick from raid6 to raid10
> 3) recreate directory of brick
> 4) gluster volume start <> force
> 5) gluster volume heal <> full
>
> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
> little heavy but nothing shocking.
>
> About an hour after node 1 finished I began same process on node2.  Heal
> proces kicked in as before and the files in directories visible from mount
> and .glusterfs healed in short time.  Then it began crawl of .shard adding
> those files to heal count at which point the entire proces ground to a halt
> basically.  After 48 hours out of 19k shards it has added 5900 to heal
> list.  Load on all 3 machnes is negligible.   It was suggested to change
> this value to full cluster.data-self-heal-algorithm and restart volume
> which I did.  No efffect.  Tried relaunching heal no effect, despite any
> node picked.  I started each VM and performed a stat of all files from
> within it, or a full virus scan  and that seemed to cause short small
> spikes in shards added, but not by much.  Logs are showing no real messages
> indicating anything is going on.  I get hits to brick log on occasion of
> null lookups making me think its not really crawling shards directory but
> waiting for a shard lookup to add it.  I'll get following in brick log but
> not constant and sometime multiple for same shard.
>
> [2016-08-29 08:31:57.478125] W [MSGID: 115009]
> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution
> type for (null) (LOOKUP)
> [2016-08-29 08:31:57.478170] E [MSGID: 115050]
> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
> LOOKUP (null) (---00
> 00-/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
> argument) [Invalid argument]
>
> This one repeated about 30 times in row then nothing for 10 minutes then
> one hit for one different shard by itself.
>
> How can I determine if Heal is actually running?  How can I kill it or
> force restart?  Does node I start it from determine which directory gets
> crawled to determine heals?
>
> *David Gossage*
> *Carousel Checks Inc. | System Administrator*
> *Office* 708.613.2284
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread David Gossage
Centos 7 Gluster 3.8.3

Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
Options Reconfigured:
cluster.data-self-heal-algorithm: full
cluster.self-heal-daemon: on
cluster.locking-scheme: granular
features.shard-block-size: 64MB
features.shard: on
performance.readdir-ahead: on
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
server.allow-insecure: on
cluster.self-heal-window-size: 1024
cluster.background-self-heal-count: 16
performance.strict-write-ordering: off
nfs.disable: on
nfs.addr-namelookup: off
nfs.enable-ino32: off
cluster.granular-entry-heal: on

Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
Following steps detailed in previous recommendations began proces of
replacing and healngbricks one node at a time.

1) kill pid of brick
2) reconfigure brick from raid6 to raid10
3) recreate directory of brick
4) gluster volume start <> force
5) gluster volume heal <> full

1st node worked as expected took 12 hours to heal 1TB data.  Load was
little heavy but nothing shocking.

About an hour after node 1 finished I began same process on node2.  Heal
proces kicked in as before and the files in directories visible from mount
and .glusterfs healed in short time.  Then it began crawl of .shard adding
those files to heal count at which point the entire proces ground to a halt
basically.  After 48 hours out of 19k shards it has added 5900 to heal
list.  Load on all 3 machnes is negligible.   It was suggested to change
this value to full cluster.data-self-heal-algorithm and restart volume
which I did.  No efffect.  Tried relaunching heal no effect, despite any
node picked.  I started each VM and performed a stat of all files from
within it, or a full virus scan  and that seemed to cause short small
spikes in shards added, but not by much.  Logs are showing no real messages
indicating anything is going on.  I get hits to brick log on occasion of
null lookups making me think its not really crawling shards directory but
waiting for a shard lookup to add it.  I'll get following in brick log but
not constant and sometime multiple for same shard.

[2016-08-29 08:31:57.478125] W [MSGID: 115009]
[server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution type
for (null) (LOOKUP)
[2016-08-29 08:31:57.478170] E [MSGID: 115050]
[server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
LOOKUP (null) (---00
00-/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
argument) [Invalid argument]

This one repeated about 30 times in row then nothing for 10 minutes then
one hit for one different shard by itself.

How can I determine if Heal is actually running?  How can I kill it or
force restart?  Does node I start it from determine which directory gets
crawled to determine heals?

*David Gossage*
*Carousel Checks Inc. | System Administrator*
*Office* 708.613.2284
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users