Re: [Gluster-users] [ovirt-users] Re: VM disk corruption with LSM on Gluster

2019-03-26 Thread Krutika Dhananjay
Could you enable strict-o-direct and disable remote-dio on the src volume
as well, restart the vms on "old" and retry migration?

# gluster volume set  performance.strict-o-direct on
# gluster volume set  network.remote-dio off

-Krutika

On Tue, Mar 26, 2019 at 10:32 PM Sander Hoentjen  wrote:

> On 26-03-19 14:23, Sahina Bose wrote:
> > +Krutika Dhananjay and gluster ml
> >
> > On Tue, Mar 26, 2019 at 6:16 PM Sander Hoentjen 
> wrote:
> >> Hello,
> >>
> >> tl;dr We have disk corruption when doing live storage migration on oVirt
> >> 4.2 with gluster 3.12.15. Any idea why?
> >>
> >> We have a 3-node oVirt cluster that is both compute and gluster-storage.
> >> The manager runs on separate hardware. We are running out of space on
> >> this volume, so we added another Gluster volume that is bigger, put a
> >> storage domain on it and then we migrated VM's to it with LSM. After
> >> some time, we noticed that (some of) the migrated VM's had corrupted
> >> filesystems. After moving everything back with export-import to the old
> >> domain where possible, and recovering from backups where needed we set
> >> off to investigate this issue.
> >>
> >> We are now at the point where we can reproduce this issue within a day.
> >> What we have found so far:
> >> 1) The corruption occurs at the very end of the replication step, most
> >> probably between START and FINISH of diskReplicateFinish, before the
> >> START merge step
> >> 2) In the corrupted VM, at some place where data should be, this data is
> >> replaced by zero's. This can be file-contents or a directory-structure
> >> or whatever.
> >> 3) The source gluster volume has different settings then the destination
> >> (Mostly because the defaults were different at creation time):
> >>
> >> Setting old(src)  new(dst)
> >> cluster.op-version  30800 30800 (the same)
> >> cluster.max-op-version  31202 31202 (the same)
> >> cluster.metadata-self-heal  off   on
> >> cluster.data-self-heal  off   on
> >> cluster.entry-self-heal off   on
> >> performance.low-prio-threads1632
> >> performance.strict-o-direct off   on
> >> network.ping-timeout4230
> >> network.remote-dio  enableoff
> >> transport.address-family- inet
> >> performance.stat-prefetch   off   on
> >> features.shard-block-size   512MB 64MB
> >> cluster.shd-max-threads 1 8
> >> cluster.shd-wait-qlength1024  1
> >> cluster.locking-scheme  full  granular
> >> cluster.granular-entry-heal noenable
> >>
> >> 4) To test, we migrate some VM's back and forth. The corruption does not
> >> occur every time. To this point it only occurs from old to new, but we
> >> don't have enough data-points to be sure about that.
> >>
> >> Anybody an idea what is causing the corruption? Is this the best list to
> >> ask, or should I ask on a Gluster list? I am not sure if this is oVirt
> >> specific or Gluster specific though.
> > Do you have logs from old and new gluster volumes? Any errors in the
> > new volume's fuse mount logs?
>
> Around the time of corruption I see the message:
> The message "I [MSGID: 133017] [shard.c:4941:shard_seek]
> 0-ZoneA_Gluster1-shard: seek called on
> 7fabc273-3d8a-4a49-8906-b8ccbea4a49f. [Operation not supported]" repeated
> 231 times between [2019-03-26 13:14:22.297333] and [2019-03-26
> 13:15:42.912170]
>
> I also see this message at other times, when I don't see the corruption
> occur, though.
>
> --
> Sander
> ___
> Users mailing list -- us...@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/us...@ovirt.org/message/M3T2VGGGV6DE643ZKKJUAF274VSWTJFH/
>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Prioritise local bricks for IO?

2019-03-26 Thread Vlad Kopylov
I don't remember if it still in works
NUFA
https://github.com/gluster/glusterfs-specs/blob/master/done/Features/nufa.md

v

On Tue, Mar 26, 2019 at 7:27 AM Nux!  wrote:

> Hello,
>
> I'm trying to set up a distributed backup storage (no replicas), but I'd
> like to prioritise the local bricks for any IO done on the volume.
> This will be a backup stor, so in other words, I'd like the files to be
> written locally if there is space, so as to save the NICs for other traffic.
>
> Anyone knows how this might be achievable, if at all?
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-26 Thread Raghavendra Gowdappa
All,

Glusterfs cleans up POSIX locks held on an fd when the client/mount through
which those locks are held disconnects from bricks/server. This helps
Glusterfs to not run into a stale lock problem later (For eg., if
application unlocks while the connection was still down). However, this
means the lock is no longer exclusive as other applications/clients can
acquire the same lock. To communicate that locks are no longer valid, we
are planning to mark the fd (which has POSIX locks) bad on a disconnect so
that any future operations on that fd will fail, forcing the application to
re-open the fd and re-acquire locks it needs [1].

Note that with AFR/replicate in picture we can prevent errors to
application as long as Quorum number of children "never ever" lost
connection with bricks after locks have been acquired. I am using the term
"never ever" as locks are not healed back after re-connection and hence
first disconnect would've marked the fd bad and the fd remains so even
after re-connection happens. So, its not just Quorum number of children
"currently online", but Quorum number of children "never having
disconnected with bricks after locks are acquired".

However, this use case is not affected if the application don't acquire any
POSIX locks. So, I am interested in knowing
* whether your use cases use POSIX locks?
* Is it feasible for your application to re-open fds and re-acquire locks
on seeing EBADFD errors?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7

regards,
Raghavendra
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] recovery from reboot time?

2019-03-26 Thread Alvin Starr

I tracked down the 2 gfis and it looks like they were "partly?" configured.

I copied the data off the gluster volume  they existed on and then 
removed the files on the server and recreated them on the client.


Things seem to be sane again but at this point I am not amazingly 
confident in the consistency of the filesystem.


I will try running a bit-rot scan against the system to see if there are 
any errors.



On 3/26/19 11:45 AM, Sankarshan Mukhopadhyay wrote:

On Tue, Mar 26, 2019 at 6:10 PM Alvin Starr  wrote:

After almost a week of doing nothing the brick failed and we were able to stop 
and restart glusterd and then could start a manual heal.

It was interesting when the heal started the time to completion was just about 
21 days but as it worked through the 30 some entries it got faster to the 
point where it completed in 2 days.

Now I have 2 gfids that refuse to heal.


Do you need help from the developers on that topic?
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


--
Alvin Starr   ||   land:  (905)513-7688
Netvel Inc.   ||   Cell:  (416)806-0133
al...@netvel.net  ||

___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] [ovirt-users] Re: VM disk corruption with LSM on Gluster

2019-03-26 Thread Sander Hoentjen
On 26-03-19 14:23, Sahina Bose wrote:
> +Krutika Dhananjay and gluster ml
>
> On Tue, Mar 26, 2019 at 6:16 PM Sander Hoentjen  wrote:
>> Hello,
>>
>> tl;dr We have disk corruption when doing live storage migration on oVirt
>> 4.2 with gluster 3.12.15. Any idea why?
>>
>> We have a 3-node oVirt cluster that is both compute and gluster-storage.
>> The manager runs on separate hardware. We are running out of space on
>> this volume, so we added another Gluster volume that is bigger, put a
>> storage domain on it and then we migrated VM's to it with LSM. After
>> some time, we noticed that (some of) the migrated VM's had corrupted
>> filesystems. After moving everything back with export-import to the old
>> domain where possible, and recovering from backups where needed we set
>> off to investigate this issue.
>>
>> We are now at the point where we can reproduce this issue within a day.
>> What we have found so far:
>> 1) The corruption occurs at the very end of the replication step, most
>> probably between START and FINISH of diskReplicateFinish, before the
>> START merge step
>> 2) In the corrupted VM, at some place where data should be, this data is
>> replaced by zero's. This can be file-contents or a directory-structure
>> or whatever.
>> 3) The source gluster volume has different settings then the destination
>> (Mostly because the defaults were different at creation time):
>>
>> Setting old(src)  new(dst)
>> cluster.op-version  30800 30800 (the same)
>> cluster.max-op-version  31202 31202 (the same)
>> cluster.metadata-self-heal  off   on
>> cluster.data-self-heal  off   on
>> cluster.entry-self-heal off   on
>> performance.low-prio-threads1632
>> performance.strict-o-direct off   on
>> network.ping-timeout4230
>> network.remote-dio  enableoff
>> transport.address-family- inet
>> performance.stat-prefetch   off   on
>> features.shard-block-size   512MB 64MB
>> cluster.shd-max-threads 1 8
>> cluster.shd-wait-qlength1024  1
>> cluster.locking-scheme  full  granular
>> cluster.granular-entry-heal noenable
>>
>> 4) To test, we migrate some VM's back and forth. The corruption does not
>> occur every time. To this point it only occurs from old to new, but we
>> don't have enough data-points to be sure about that.
>>
>> Anybody an idea what is causing the corruption? Is this the best list to
>> ask, or should I ask on a Gluster list? I am not sure if this is oVirt
>> specific or Gluster specific though.
> Do you have logs from old and new gluster volumes? Any errors in the
> new volume's fuse mount logs?

Around the time of corruption I see the message:
The message "I [MSGID: 133017] [shard.c:4941:shard_seek] 
0-ZoneA_Gluster1-shard: seek called on 7fabc273-3d8a-4a49-8906-b8ccbea4a49f. 
[Operation not supported]" repeated 231 times between [2019-03-26 
13:14:22.297333] and [2019-03-26 13:15:42.912170]

I also see this message at other times, when I don't see the corruption occur, 
though.

-- 
Sander
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-Maintainers] Announcing Gluster release 5.5

2019-03-26 Thread Niels de Vos
On Tue, Mar 26, 2019 at 11:26:00AM -0500, Darrell Budic wrote:
> Heads up for the Centos storage maintainers, I’ve tested 5.5 on my dev 
> cluster and it behaves well. It also resolved rolling upgrade issues in a 
> hyperconverged ovirt cluster for me, so I recommend moving it out of testing.

Thanks for the info! Packages have been pushed to the CentOS mirrors
yesterday already. Some mirrors take a little more time to catch up, but
I expect that all have the update by now.

Niels


> 
>   -Darrell
> 
> > On Mar 21, 2019, at 6:06 AM, Shyam Ranganathan  wrote:
> > 
> > The Gluster community is pleased to announce the release of Gluster
> > 5.5 (packages available at [1]).
> > 
> > Release notes for the release can be found at [3].
> > 
> > Major changes, features and limitations addressed in this release:
> > 
> > - Release 5.4 introduced an incompatible change that prevented rolling
> > upgrades, and hence was never announced to the lists. As a result we are
> > jumping a release version and going to 5.5 from 5.3, that does not have
> > the problem.
> > 
> > Thanks,
> > Gluster community
> > 
> > [1] Packages for 5.5:
> > https://download.gluster.org/pub/gluster/glusterfs/5/5.5/
> > 
> > [2] Release notes for 5.5:
> > https://docs.gluster.org/en/latest/release-notes/5.5/
> > ___
> > Gluster-users mailing list
> > Gluster-users@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-users
> 
> ___
> maintainers mailing list
> maintain...@gluster.org
> https://lists.gluster.org/mailman/listinfo/maintainers
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Announcing Gluster release 5.5

2019-03-26 Thread Darrell Budic
Heads up for the Centos storage maintainers, I’ve tested 5.5 on my dev cluster 
and it behaves well. It also resolved rolling upgrade issues in a 
hyperconverged ovirt cluster for me, so I recommend moving it out of testing.

  -Darrell

> On Mar 21, 2019, at 6:06 AM, Shyam Ranganathan  wrote:
> 
> The Gluster community is pleased to announce the release of Gluster
> 5.5 (packages available at [1]).
> 
> Release notes for the release can be found at [3].
> 
> Major changes, features and limitations addressed in this release:
> 
> - Release 5.4 introduced an incompatible change that prevented rolling
> upgrades, and hence was never announced to the lists. As a result we are
> jumping a release version and going to 5.5 from 5.3, that does not have
> the problem.
> 
> Thanks,
> Gluster community
> 
> [1] Packages for 5.5:
> https://download.gluster.org/pub/gluster/glusterfs/5/5.5/
> 
> [2] Release notes for 5.5:
> https://docs.gluster.org/en/latest/release-notes/5.5/
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users

___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] recovery from reboot time?

2019-03-26 Thread Sankarshan Mukhopadhyay
On Tue, Mar 26, 2019 at 6:10 PM Alvin Starr  wrote:
>
> After almost a week of doing nothing the brick failed and we were able to 
> stop and restart glusterd and then could start a manual heal.
>
> It was interesting when the heal started the time to completion was just 
> about 21 days but as it worked through the 30 some entries it got faster 
> to the point where it completed in 2 days.
>
> Now I have 2 gfids that refuse to heal.
>

Do you need help from the developers on that topic?
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Geo-replication status always on 'Created'

2019-03-26 Thread Aravinda
Please check error message in gsyncd.log file in
/var/log/glusterfs/geo-replication/

On Tue, 2019-03-26 at 19:44 +0530, Maurya M wrote:
> Hi Arvind,
>  Have patched my setup with your fix: re-run the setup, but this time
> getting a different error where it failed to commit the ssh-port on
> my other 2 nodes on the master cluster, so manually copied the :
> [vars]
> ssh-port = 
> 
> into gsyncd.conf
> 
> and status reported back is as shown below :  Any ideas how to
> troubleshoot this?
> 
> MASTER NODE  MASTER VOL  MASTER
> BRICK   
>SLAVE USERSLAVE   
>   SLAVE NODE  STATUS   
>  CRAWL STATUSLAST_SYNCED
> ---
> ---
> ---
> ---
> --
> 172.16.189.4 vol_75a5fd373d88ba687f591f3353fa05cf   
> /var/lib/heketi/mounts/vg_aee3df7b0bb2451bc00a73358c5196a2/brick_116f
> b9427fb26f752d9ba8e45e183cb1/brickroot 
> 172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33f172.16.201.4 
>   PassiveN/A N/A
> 172.16.189.35vol_75a5fd373d88ba687f591f3353fa05cf   
> /var/lib/heketi/mounts/vg_05708751110fe60b3e7da15bdcf6d4d4/brick_266b
> b08f0d466d346f8c0b19569736fb/brickroot 
> 172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33fN/A   
>  Faulty N/A N/A
> 172.16.189.66vol_75a5fd373d88ba687f591f3353fa05cf   
> /var/lib/heketi/mounts/vg_4b92a2b687e59b7311055d3809b77c06/brick_dfa4
> 4c9380cdedac708e27e2c2a443a0/brickroot 
> 172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33fN/A   
>  Initializing...N/A N/A
> 
> 
> 
> 
> On Tue, Mar 26, 2019 at 1:40 PM Aravinda  wrote:
> > I got chance to investigate this issue further and identified a
> > issue
> > with Geo-replication config set and sent patch to fix the same.
> > 
> > BUG: https://bugzilla.redhat.com/show_bug.cgi?id=1692666
> > Patch: https://review.gluster.org/22418
> > 
> > On Mon, 2019-03-25 at 15:37 +0530, Maurya M wrote:
> > > ran this command :  ssh -p  -i /var/lib/glusterd/geo-
> > > replication/secret.pem root@gluster volume info --
> > xml 
> > > 
> > > attaching the output.
> > > 
> > > 
> > > 
> > > On Mon, Mar 25, 2019 at 2:13 PM Aravinda 
> > wrote:
> > > > Geo-rep is running `ssh -i /var/lib/glusterd/geo-
> > > > replication/secret.pem 
> > > > root@ gluster volume info --xml` and parsing its
> > output.
> > > > Please try to to run the command from the same node and let us
> > know
> > > > the
> > > > output.
> > > > 
> > > > 
> > > > On Mon, 2019-03-25 at 11:43 +0530, Maurya M wrote:
> > > > > Now the error is on the same line 860 : as highlighted below:
> > > > > 
> > > > > [2019-03-25 06:11:52.376238] E
> > > > > [syncdutils(monitor):332:log_raise_exception] : FAIL:
> > > > > Traceback (most recent call last):
> > > > >   File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py",
> > line
> > > > > 311, in main
> > > > > func(args)
> > > > >   File "/usr/libexec/glusterfs/python/syncdaemon/subcmds.py",
> > > > line
> > > > > 50, in subcmd_monitor
> > > > > return monitor.monitor(local, remote)
> > > > >   File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py",
> > > > line
> > > > > 427, in monitor
> > > > > return Monitor().multiplex(*distribute(local, remote))
> > > > >   File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py",
> > > > line
> > > > > 386, in distribute
> > > > > svol = Volinfo(slave.volume, "localhost", prelude)
> > > > >   File
> > "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py",
> > > > line
> > > > > 860, in __init__
> > > > > vi = XET.fromstring(vix)
> > > > >   File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line
> > > > 1300, in
> > > > > XML
> > > > > parser.feed(text)
> > > > >   File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line
> > > > 1642, in
> > > > > feed
> > > > > self._raiseerror(v)
> > > > >   File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line
> > > > 1506, in
> > > > > _raiseerror
> > > > > raise err
> > > > > ParseError: syntax error: line 1, column 0
> > > > > 
> > > > > 
> > > > > On Mon, Mar 25, 2019 at 11:29 AM Maurya M 
> > > > wrote:
> > > > > > Sorry my bad, had put the print line to debug, i am using
> > > > gluster
> > > > > > 4.1.7, will remove the print line.
> > > > > > 
> > > > > > On Mon, Mar 25, 2019 at 10:52 AM Aravinda <
> > avish...@redhat.com>
> > > > > > wrote:
> > > > > > > Below print statement looks wrong. Latest Glusterfs code
> > > > doesn't
> > > > > > > have
> > > > > > > this print statement. Please let us know which version of

Re: [Gluster-users] Geo-replication status always on 'Created'

2019-03-26 Thread Maurya M
Hi Arvind,
 Have patched my setup with your fix: re-run the setup, but this time
getting a different error where it failed to commit the ssh-port on my
other 2 nodes on the master cluster, so manually copied the :
*[vars]*
*ssh-port = *

into gsyncd.conf

and status reported back is as shown below :  Any ideas how to troubleshoot
this?

MASTER NODE  MASTER VOL  MASTER BRICK

 SLAVE USERSLAVE
  SLAVE NODE  STATUS CRAWL STATUSLAST_SYNCED
--
172.16.189.4 vol_75a5fd373d88ba687f591f3353fa05cf
/var/lib/heketi/mounts/vg_aee3df7b0bb2451bc00a73358c5196a2/brick_116fb9427fb26f752d9ba8e45e183cb1/brick
  root  172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33f
172.16.201.4*Passive*N/A N/A
172.16.189.35vol_75a5fd373d88ba687f591f3353fa05cf
/var/lib/heketi/mounts/vg_05708751110fe60b3e7da15bdcf6d4d4/brick_266bb08f0d466d346f8c0b19569736fb/brick
  root  172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33fN/A
   *Faulty *N/A N/A
172.16.189.66vol_75a5fd373d88ba687f591f3353fa05cf
/var/lib/heketi/mounts/vg_4b92a2b687e59b7311055d3809b77c06/brick_dfa44c9380cdedac708e27e2c2a443a0/brick
  root  172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33fN/A
   *Initializing*...N/A N/A




On Tue, Mar 26, 2019 at 1:40 PM Aravinda  wrote:

> I got chance to investigate this issue further and identified a issue
> with Geo-replication config set and sent patch to fix the same.
>
> BUG: https://bugzilla.redhat.com/show_bug.cgi?id=1692666
> Patch: https://review.gluster.org/22418
>
> On Mon, 2019-03-25 at 15:37 +0530, Maurya M wrote:
> > ran this command :  ssh -p  -i /var/lib/glusterd/geo-
> > replication/secret.pem root@gluster volume info --xml
> >
> > attaching the output.
> >
> >
> >
> > On Mon, Mar 25, 2019 at 2:13 PM Aravinda  wrote:
> > > Geo-rep is running `ssh -i /var/lib/glusterd/geo-
> > > replication/secret.pem
> > > root@ gluster volume info --xml` and parsing its output.
> > > Please try to to run the command from the same node and let us know
> > > the
> > > output.
> > >
> > >
> > > On Mon, 2019-03-25 at 11:43 +0530, Maurya M wrote:
> > > > Now the error is on the same line 860 : as highlighted below:
> > > >
> > > > [2019-03-25 06:11:52.376238] E
> > > > [syncdutils(monitor):332:log_raise_exception] : FAIL:
> > > > Traceback (most recent call last):
> > > >   File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line
> > > > 311, in main
> > > > func(args)
> > > >   File "/usr/libexec/glusterfs/python/syncdaemon/subcmds.py",
> > > line
> > > > 50, in subcmd_monitor
> > > > return monitor.monitor(local, remote)
> > > >   File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py",
> > > line
> > > > 427, in monitor
> > > > return Monitor().multiplex(*distribute(local, remote))
> > > >   File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py",
> > > line
> > > > 386, in distribute
> > > > svol = Volinfo(slave.volume, "localhost", prelude)
> > > >   File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py",
> > > line
> > > > 860, in __init__
> > > > vi = XET.fromstring(vix)
> > > >   File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line
> > > 1300, in
> > > > XML
> > > > parser.feed(text)
> > > >   File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line
> > > 1642, in
> > > > feed
> > > > self._raiseerror(v)
> > > >   File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line
> > > 1506, in
> > > > _raiseerror
> > > > raise err
> > > > ParseError: syntax error: line 1, column 0
> > > >
> > > >
> > > > On Mon, Mar 25, 2019 at 11:29 AM Maurya M 
> > > wrote:
> > > > > Sorry my bad, had put the print line to debug, i am using
> > > gluster
> > > > > 4.1.7, will remove the print line.
> > > > >
> > > > > On Mon, Mar 25, 2019 at 10:52 AM Aravinda 
> > > > > wrote:
> > > > > > Below print statement looks wrong. Latest Glusterfs code
> > > doesn't
> > > > > > have
> > > > > > this print statement. Please let us know which version of
> > > > > > glusterfs you
> > > > > > are using.
> > > > > >
> > > > > >
> > > > > > ```
> > > > > >   File
> > > "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py",
> > > > > > line
> > > > > > 860, in __init__
> > > > > > print "debug varible " %vix
> > > > > > ```
> > > > > >
> > > > > > As a workaround, edit that file and comment the print line
> > > and
> > > > > > test the
> > > > > > geo-rep config command.
> > > > > >
> > > > > >
> > > > > > On Mon, 2019-03-25 at 09:46 +0530, Maurya M wrote:
> > > > > > > hi Aravinda,
> > > > > > >  had the session 

Re: [Gluster-users] [ovirt-users] VM disk corruption with LSM on Gluster

2019-03-26 Thread Sahina Bose
+Krutika Dhananjay and gluster ml

On Tue, Mar 26, 2019 at 6:16 PM Sander Hoentjen  wrote:
>
> Hello,
>
> tl;dr We have disk corruption when doing live storage migration on oVirt
> 4.2 with gluster 3.12.15. Any idea why?
>
> We have a 3-node oVirt cluster that is both compute and gluster-storage.
> The manager runs on separate hardware. We are running out of space on
> this volume, so we added another Gluster volume that is bigger, put a
> storage domain on it and then we migrated VM's to it with LSM. After
> some time, we noticed that (some of) the migrated VM's had corrupted
> filesystems. After moving everything back with export-import to the old
> domain where possible, and recovering from backups where needed we set
> off to investigate this issue.
>
> We are now at the point where we can reproduce this issue within a day.
> What we have found so far:
> 1) The corruption occurs at the very end of the replication step, most
> probably between START and FINISH of diskReplicateFinish, before the
> START merge step
> 2) In the corrupted VM, at some place where data should be, this data is
> replaced by zero's. This can be file-contents or a directory-structure
> or whatever.
> 3) The source gluster volume has different settings then the destination
> (Mostly because the defaults were different at creation time):
>
> Setting old(src)  new(dst)
> cluster.op-version  30800 30800 (the same)
> cluster.max-op-version  31202 31202 (the same)
> cluster.metadata-self-heal  off   on
> cluster.data-self-heal  off   on
> cluster.entry-self-heal off   on
> performance.low-prio-threads1632
> performance.strict-o-direct off   on
> network.ping-timeout4230
> network.remote-dio  enableoff
> transport.address-family- inet
> performance.stat-prefetch   off   on
> features.shard-block-size   512MB 64MB
> cluster.shd-max-threads 1 8
> cluster.shd-wait-qlength1024  1
> cluster.locking-scheme  full  granular
> cluster.granular-entry-heal noenable
>
> 4) To test, we migrate some VM's back and forth. The corruption does not
> occur every time. To this point it only occurs from old to new, but we
> don't have enough data-points to be sure about that.
>
> Anybody an idea what is causing the corruption? Is this the best list to
> ask, or should I ask on a Gluster list? I am not sure if this is oVirt
> specific or Gluster specific though.

Do you have logs from old and new gluster volumes? Any errors in the
new volume's fuse mount logs?

>
> Kind regards,
> Sander Hoentjen
> ___
> Users mailing list -- us...@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives: 
> https://lists.ovirt.org/archives/list/us...@ovirt.org/message/43E2QYJYDHPYTIU3IFS53WS4WL5OFXUV/
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] recovery from reboot time?

2019-03-26 Thread Alvin Starr
After almost a week of doing nothing the brick failed and we were able 
to stop and restart glusterd and then could start a manual heal.


It was interesting when the heal started the time to completion was just 
about 21 days but as it worked through the 30 some entries it got 
faster to the point where it completed in 2 days.


Now I have 2 gfids that refuse to heal.

We have also been looking at converting these systems to RHEL and buying 
support from RH but it seems that the sales arm is not interested in 
calling people back.


On 3/20/19 1:39 AM, Amar Tumballi Suryanarayan wrote:

There are 2 things happen after a reboot.

1. glusterd (management layer) does a sanity check of its volumes, and 
sees if there are anything different while it went down, and tries to 
correct its state.
  - This is fine as long as number of volumes are less, or numbers of 
nodes are less. (less is referred as < 100).


2. If it is a replicate or disperse volume, then self-heal daemon does 
check if there are any self-heal pending.
  - This does a 'index' crawl to check which files actually changed 
when one of the brick/node was down.

  - If this list is big, it can sometimes does take some time.

But 'Days/weeks/month' is not a expected/observed behavior. Is there 
any logs in the log file? If not, can you do a 'strace -f' to the pid 
which is consuming major CPU?? (strace for 1 mins sample is good enough).


-Amar


On Wed, Mar 20, 2019 at 2:05 AM Alvin Starr > wrote:


We have a simple replicated volume  with 1 brick on each node of 17TB.

There is something like 35M files and directories on the volume.

One of the servers rebooted and is now "doing something".

It kind of looks like its doing some kind of sality check with the
node
that did not reboot but its hard to say and it looks like it may
run for
hours/days/months

Will Gluster take a long time with Lots of little files to resync?


-- 
Alvin Starr                   ||   land:  (905)513-7688

Netvel Inc.                   ||   Cell:  (416)806-0133
al...@netvel.net               ||

___
Gluster-users mailing list
Gluster-users@gluster.org 
https://lists.gluster.org/mailman/listinfo/gluster-users



--
Amar Tumballi (amarts)


--
Alvin Starr   ||   land:  (905)513-7688
Netvel Inc.   ||   Cell:  (416)806-0133
al...@netvel.net  ||

___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Prioritise local bricks for IO?

2019-03-26 Thread Nux!
Hello,

I'm trying to set up a distributed backup storage (no replicas), but I'd like 
to prioritise the local bricks for any IO done on the volume.
This will be a backup stor, so in other words, I'd like the files to be written 
locally if there is space, so as to save the NICs for other traffic.

Anyone knows how this might be achievable, if at all?

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Geo-replication status always on 'Created'

2019-03-26 Thread Aravinda
I got chance to investigate this issue further and identified a issue
with Geo-replication config set and sent patch to fix the same.

BUG: https://bugzilla.redhat.com/show_bug.cgi?id=1692666
Patch: https://review.gluster.org/22418

On Mon, 2019-03-25 at 15:37 +0530, Maurya M wrote:
> ran this command :  ssh -p  -i /var/lib/glusterd/geo-
> replication/secret.pem root@gluster volume info --xml 
> 
> attaching the output.
> 
> 
> 
> On Mon, Mar 25, 2019 at 2:13 PM Aravinda  wrote:
> > Geo-rep is running `ssh -i /var/lib/glusterd/geo-
> > replication/secret.pem 
> > root@ gluster volume info --xml` and parsing its output.
> > Please try to to run the command from the same node and let us know
> > the
> > output.
> > 
> > 
> > On Mon, 2019-03-25 at 11:43 +0530, Maurya M wrote:
> > > Now the error is on the same line 860 : as highlighted below:
> > > 
> > > [2019-03-25 06:11:52.376238] E
> > > [syncdutils(monitor):332:log_raise_exception] : FAIL:
> > > Traceback (most recent call last):
> > >   File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line
> > > 311, in main
> > > func(args)
> > >   File "/usr/libexec/glusterfs/python/syncdaemon/subcmds.py",
> > line
> > > 50, in subcmd_monitor
> > > return monitor.monitor(local, remote)
> > >   File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py",
> > line
> > > 427, in monitor
> > > return Monitor().multiplex(*distribute(local, remote))
> > >   File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py",
> > line
> > > 386, in distribute
> > > svol = Volinfo(slave.volume, "localhost", prelude)
> > >   File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py",
> > line
> > > 860, in __init__
> > > vi = XET.fromstring(vix)
> > >   File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line
> > 1300, in
> > > XML
> > > parser.feed(text)
> > >   File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line
> > 1642, in
> > > feed
> > > self._raiseerror(v)
> > >   File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line
> > 1506, in
> > > _raiseerror
> > > raise err
> > > ParseError: syntax error: line 1, column 0
> > > 
> > > 
> > > On Mon, Mar 25, 2019 at 11:29 AM Maurya M 
> > wrote:
> > > > Sorry my bad, had put the print line to debug, i am using
> > gluster
> > > > 4.1.7, will remove the print line.
> > > > 
> > > > On Mon, Mar 25, 2019 at 10:52 AM Aravinda 
> > > > wrote:
> > > > > Below print statement looks wrong. Latest Glusterfs code
> > doesn't
> > > > > have
> > > > > this print statement. Please let us know which version of
> > > > > glusterfs you
> > > > > are using.
> > > > > 
> > > > > 
> > > > > ```
> > > > >   File
> > "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py",
> > > > > line
> > > > > 860, in __init__
> > > > > print "debug varible " %vix
> > > > > ```
> > > > > 
> > > > > As a workaround, edit that file and comment the print line
> > and
> > > > > test the
> > > > > geo-rep config command.
> > > > > 
> > > > > 
> > > > > On Mon, 2019-03-25 at 09:46 +0530, Maurya M wrote:
> > > > > > hi Aravinda,
> > > > > >  had the session created using : create ssh-port  push-
> > pem
> > > > > and
> > > > > > also the :
> > > > > > 
> > > > > > gluster volume geo-replication
> > > > > vol_75a5fd373d88ba687f591f3353fa05cf
> > > > > > 172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33f config
> > ssh-
> > > > > port
> > > > > > 
> > > > > > 
> > > > > > hitting this message:
> > > > > > geo-replication config-set failed for
> > > > > > vol_75a5fd373d88ba687f591f3353fa05cf
> > > > > > 172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33f
> > > > > > geo-replication command failed
> > > > > > 
> > > > > > Below is snap of status:
> > > > > > 
> > > > > > [root@k8s-agentpool1-24779565-1
> > > > > >
> > > > >
> > vol_75a5fd373d88ba687f591f3353fa05cf_172.16.201.35_vol_e783a73057
> > > > > 8e45ed9d51b9a80df6c33f]# gluster volume geo-replication
> > > > > vol_75a5fd373d88ba687f591f3353fa05cf
> > > > > 172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33f status
> > > > > > 
> > > > > > MASTER NODE  MASTER VOL 
> > MASTER
> > > > > > BRICK 
> >
> > > > >  
> > > > > >SLAVE USERSLAVE 
> >
> > > > >  
> > > > > >   SLAVE NODESTATUS   
> >  CRAWL
> > > > > STATUS 
> > > > > >   LAST_SYNCED
> > > > > > -
> > 
> > > > > --
> > > > > > -
> > 
> > > > > --
> > > > > > -
> > 
> > > > > --
> > > > > > -
> > 
> > > > > --
> > > > > > 
> > > > > > 172.16.189.4 vol_75a5fd373d88ba687f591f3353fa05cf   
> > > > > >
> > > > >
> >