Re: [Gluster-users] VM disks corruption on 3.7.11

2016-06-13 Thread Kevin Lemonnier
> Kevin, did you solve this issue? Any updates?

Oh yeah, we discussed it on IRC and it's apparently a known bug,
it's fixed in the next version. I tested a patched version and it
does seem to work, so I've been waiting for 3.7.12 since then to
do some proper testing and confirm that it's been solved !
Should be out hopefully in the next few days last I heard.

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111


signature.asc
Description: Digital signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-06-13 Thread Gandalf Corvotempesta
2016-05-27 13:56 GMT+02:00 Kevin Lemonnier :
> Yes, I did configure it to do a daily scrub when I reinstalled last time,
> when I was wondering if maybe it was hardware. Doesn't seem like it detected
> anything.

Kevin, did you solve this issue? Any updates?
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-27 Thread Lindsay Mathieson

On 27/05/2016 9:56 PM, Kevin Lemonnier wrote:

Yes, I did configure it to do a daily scrub when I reinstalled last time,
when I was wondering if maybe it was hardware. Doesn't seem like it detected
anything.


I was wondering if the scrub was interfering with things

--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-27 Thread Kevin Lemonnier
>Just a thought - do you have bitrot detection enabled? (I don't)

Yes, I did configure it to do a daily scrub when I reinstalled last time,
when I was wondering if maybe it was hardware. Doesn't seem like it detected
anything.

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111


signature.asc
Description: Digital signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-27 Thread Lindsay Mathieson

On 26/05/2016 1:58 AM, Kevin Lemonnier wrote:

There, re-created the VM from scratch, and still got the same errors.




Just a thought - do you have bitrot detection enabled? (I don't)

--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-25 Thread Kevin Lemonnier
There, re-created the VM from scratch, and still got the same errors.
Attached are the logs, I created the VM on node 50, worked fine. I tried
to reboot it and start my import again, still worked fine. I powered off the
VM, then started it again on node 2, rebooted it a bunch and just got the error
as usual, just attached a screen of the VM's console, might help.

I can see that everytime the VM powers down, glusterFS complains about an inode 
still
active, might it be the problem ?

Thanks for the help !



On Wed, May 25, 2016 at 04:10:02PM +0200, Kevin Lemonnier wrote:
> Just did that, below is the output.
> Didn't seem to move after the boot, and no new lines when the I/O errors 
> appeared.
> Also, as mentionned I tried moving the disk on NFS and had the exact same 
> errors,
> so it doesn't look like it's a libgfapi problem ..
> I should probably re-create the VM, maybe the errors from this night corrupted
> the disk and I now get errors unrelated to the original issue.
> 
> Let me re-create the VM from scratch and try to reproduce the problem with
> the logs enabled, maybe it'll be more informative than this !
> 
> 
> [2016-05-25 13:56:30.851493] I [MSGID: 104045] [glfs-master.c:95:notify] 
> 0-gfapi: New graph 6e79-3635-3033-2e69-702d34362d31 (0) coming up
> [2016-05-25 13:56:30.851553] I [MSGID: 114020] [client.c:2106:notify] 
> 0-gluster-client-0: parent translators are ready, attempting connect on 
> transport
> [2016-05-25 13:56:30.852130] I [MSGID: 114020] [client.c:2106:notify] 
> 0-gluster-client-1: parent translators are ready, attempting connect on 
> transport
> [2016-05-25 13:56:30.852650] I [MSGID: 114020] [client.c:2106:notify] 
> 0-gluster-client-2: parent translators are ready, attempting connect on 
> transport
> [2016-05-25 13:56:30.852909] I [rpc-clnt.c:1868:rpc_clnt_reconfig] 
> 0-gluster-client-0: changing port to 49152 (from 0)
> [2016-05-25 13:56:30.853434] I [rpc-clnt.c:1868:rpc_clnt_reconfig] 
> 0-gluster-client-1: changing port to 49152 (from 0)
> [2016-05-25 13:56:30.853484] I [rpc-clnt.c:1868:rpc_clnt_reconfig] 
> 0-gluster-client-2: changing port to 49152 (from 0)
> [2016-05-25 13:56:30.854182] I [MSGID: 114057] 
> [client-handshake.c:1437:select_server_supported_programs] 
> 0-gluster-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
> [2016-05-25 13:56:30.854398] I [MSGID: 114057] 
> [client-handshake.c:1437:select_server_supported_programs] 
> 0-gluster-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
> [2016-05-25 13:56:30.854441] I [MSGID: 114057] 
> [client-handshake.c:1437:select_server_supported_programs] 
> 0-gluster-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
> [2016-05-25 13:56:30.861931] I [MSGID: 114046] 
> [client-handshake.c:1213:client_setvolume_cbk] 0-gluster-client-2: Connected 
> to gluster-client-2, attached to remote volume '/mnt/storage/gluster'.
> [2016-05-25 13:56:30.861965] I [MSGID: 114047] 
> [client-handshake.c:1224:client_setvolume_cbk] 0-gluster-client-2: Server and 
> Client lk-version numbers are not same, reopening the fds
> [2016-05-25 13:56:30.862073] I [MSGID: 108005] [afr-common.c:4007:afr_notify] 
> 0-gluster-replicate-0: Subvolume 'gluster-client-2' came back up; going 
> online.
> [2016-05-25 13:56:30.862139] I [MSGID: 114035] 
> [client-handshake.c:193:client_set_lk_version_cbk] 0-gluster-client-2: Server 
> lk version = 1
> [2016-05-25 13:56:30.865451] I [MSGID: 114046] 
> [client-handshake.c:1213:client_setvolume_cbk] 0-gluster-client-1: Connected 
> to gluster-client-1, attached to remote volume '/mnt/storage/gluster'.
> [2016-05-25 13:56:30.865485] I [MSGID: 114047] 
> [client-handshake.c:1224:client_setvolume_cbk] 0-gluster-client-1: Server and 
> Client lk-version numbers are not same, reopening the fds
> [2016-05-25 13:56:30.865757] I [MSGID: 114035] 
> [client-handshake.c:193:client_set_lk_version_cbk] 0-gluster-client-1: Server 
> lk version = 1
> [2016-05-25 13:56:30.865826] I [MSGID: 114046] 
> [client-handshake.c:1213:client_setvolume_cbk] 0-gluster-client-0: Connected 
> to gluster-client-0, attached to remote volume '/mnt/storage/gluster'.
> [2016-05-25 13:56:30.865841] I [MSGID: 114047] 
> [client-handshake.c:1224:client_setvolume_cbk] 0-gluster-client-0: Server and 
> Client lk-version numbers are not same, reopening the fds
> [2016-05-25 13:56:30.888604] I [MSGID: 114035] 
> [client-handshake.c:193:client_set_lk_version_cbk] 0-gluster-client-0: Server 
> lk version = 1
> [2016-05-25 13:56:30.890388] I [MSGID: 108031] 
> [afr-common.c:1900:afr_local_discovery_cbk] 0-gluster-replicate-0: selecting 
> local read_child gluster-client-2
> [2016-05-25 13:56:30.890731] I [MSGID: 104041] 
> [glfs-resolve.c:869:__glfs_active_subvol] 0-gluster: switched to graph 
> 6e79-3635-3033-2e69-702d34362d31 (0)
> 
> 
> 
> On Wed, May 25, 2016 at 02:48:27PM +0530, Krutika Dhananjay wrote:
> >Also, it seems Lindsay knows a way to get the gluster client logs when

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-25 Thread Kevin Lemonnier
Just did that, below is the output.
Didn't seem to move after the boot, and no new lines when the I/O errors 
appeared.
Also, as mentionned I tried moving the disk on NFS and had the exact same 
errors,
so it doesn't look like it's a libgfapi problem ..
I should probably re-create the VM, maybe the errors from this night corrupted
the disk and I now get errors unrelated to the original issue.

Let me re-create the VM from scratch and try to reproduce the problem with
the logs enabled, maybe it'll be more informative than this !


[2016-05-25 13:56:30.851493] I [MSGID: 104045] [glfs-master.c:95:notify] 
0-gfapi: New graph 6e79-3635-3033-2e69-702d34362d31 (0) coming up
[2016-05-25 13:56:30.851553] I [MSGID: 114020] [client.c:2106:notify] 
0-gluster-client-0: parent translators are ready, attempting connect on 
transport
[2016-05-25 13:56:30.852130] I [MSGID: 114020] [client.c:2106:notify] 
0-gluster-client-1: parent translators are ready, attempting connect on 
transport
[2016-05-25 13:56:30.852650] I [MSGID: 114020] [client.c:2106:notify] 
0-gluster-client-2: parent translators are ready, attempting connect on 
transport
[2016-05-25 13:56:30.852909] I [rpc-clnt.c:1868:rpc_clnt_reconfig] 
0-gluster-client-0: changing port to 49152 (from 0)
[2016-05-25 13:56:30.853434] I [rpc-clnt.c:1868:rpc_clnt_reconfig] 
0-gluster-client-1: changing port to 49152 (from 0)
[2016-05-25 13:56:30.853484] I [rpc-clnt.c:1868:rpc_clnt_reconfig] 
0-gluster-client-2: changing port to 49152 (from 0)
[2016-05-25 13:56:30.854182] I [MSGID: 114057] 
[client-handshake.c:1437:select_server_supported_programs] 0-gluster-client-0: 
Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-05-25 13:56:30.854398] I [MSGID: 114057] 
[client-handshake.c:1437:select_server_supported_programs] 0-gluster-client-1: 
Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-05-25 13:56:30.854441] I [MSGID: 114057] 
[client-handshake.c:1437:select_server_supported_programs] 0-gluster-client-2: 
Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-05-25 13:56:30.861931] I [MSGID: 114046] 
[client-handshake.c:1213:client_setvolume_cbk] 0-gluster-client-2: Connected to 
gluster-client-2, attached to remote volume '/mnt/storage/gluster'.
[2016-05-25 13:56:30.861965] I [MSGID: 114047] 
[client-handshake.c:1224:client_setvolume_cbk] 0-gluster-client-2: Server and 
Client lk-version numbers are not same, reopening the fds
[2016-05-25 13:56:30.862073] I [MSGID: 108005] [afr-common.c:4007:afr_notify] 
0-gluster-replicate-0: Subvolume 'gluster-client-2' came back up; going online.
[2016-05-25 13:56:30.862139] I [MSGID: 114035] 
[client-handshake.c:193:client_set_lk_version_cbk] 0-gluster-client-2: Server 
lk version = 1
[2016-05-25 13:56:30.865451] I [MSGID: 114046] 
[client-handshake.c:1213:client_setvolume_cbk] 0-gluster-client-1: Connected to 
gluster-client-1, attached to remote volume '/mnt/storage/gluster'.
[2016-05-25 13:56:30.865485] I [MSGID: 114047] 
[client-handshake.c:1224:client_setvolume_cbk] 0-gluster-client-1: Server and 
Client lk-version numbers are not same, reopening the fds
[2016-05-25 13:56:30.865757] I [MSGID: 114035] 
[client-handshake.c:193:client_set_lk_version_cbk] 0-gluster-client-1: Server 
lk version = 1
[2016-05-25 13:56:30.865826] I [MSGID: 114046] 
[client-handshake.c:1213:client_setvolume_cbk] 0-gluster-client-0: Connected to 
gluster-client-0, attached to remote volume '/mnt/storage/gluster'.
[2016-05-25 13:56:30.865841] I [MSGID: 114047] 
[client-handshake.c:1224:client_setvolume_cbk] 0-gluster-client-0: Server and 
Client lk-version numbers are not same, reopening the fds
[2016-05-25 13:56:30.888604] I [MSGID: 114035] 
[client-handshake.c:193:client_set_lk_version_cbk] 0-gluster-client-0: Server 
lk version = 1
[2016-05-25 13:56:30.890388] I [MSGID: 108031] 
[afr-common.c:1900:afr_local_discovery_cbk] 0-gluster-replicate-0: selecting 
local read_child gluster-client-2
[2016-05-25 13:56:30.890731] I [MSGID: 104041] 
[glfs-resolve.c:869:__glfs_active_subvol] 0-gluster: switched to graph 
6e79-3635-3033-2e69-702d34362d31 (0)



On Wed, May 25, 2016 at 02:48:27PM +0530, Krutika Dhananjay wrote:
>Also, it seems Lindsay knows a way to get the gluster client logs when
>using proxmox and libgfapi.
>Would it be possible for you to get that sorted with Lindsay's help before
>recreating this issue next time
>and share the glusterfs client logs from all the nodes when you do hit the
>issue?
>It is critical for some of the debugging we do. :)
> 
>-Krutika
>On Wed, May 25, 2016 at 2:38 PM, Krutika Dhananjay 
>wrote:
> 
>  Hi Kevin,
> 
>  If you actually ran into a 'read-only filesystem' issue, then it could
>  possibly because of a bug in AFR
>  that Pranith recently fixed.
>  To confirm if that is indeed the case, could you tell meA  if you saw
>  the pause after a brick (single brick) was
>  down while IO was going 

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-25 Thread Lindsay Mathieson

On 25/05/2016 5:58 PM, Kevin Lemonnier wrote:

I use XFS, I read that was recommended. What are you using ?
Since yours seems to work, I'm not opposed to changing !


ZFS

- RAID10 (4 * WD Red 3TB)

- 8GB ram dedicated to ZFS

- SSD for log and cache (10GB and 100GB partitions respectively)

 * compression=lz4
 * atime=off
 * xattr=sa
 * sync=standard
 * acltype=posixacl


What sort of i/o load are you seeing? mine vary between 0.6% to 5%, with 
occasional spikes to 30% (updates etc).



I have had several windows VM's lock up on me in the past 4 weeks - 
maybe its related.




--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-25 Thread Krutika Dhananjay
Also, it seems Lindsay knows a way to get the gluster client logs when
using proxmox and libgfapi.
Would it be possible for you to get that sorted with Lindsay's help before
recreating this issue next time
and share the glusterfs client logs from all the nodes when you do hit the
issue?
It is critical for some of the debugging we do. :)

-Krutika


On Wed, May 25, 2016 at 2:38 PM, Krutika Dhananjay 
wrote:

> Hi Kevin,
>
>
> If you actually ran into a 'read-only filesystem' issue, then it could
> possibly because of a bug in AFR
> that Pranith recently fixed.
> To confirm if that is indeed the case, could you tell me  if you saw the
> pause after a brick (single brick) was
> down while IO was going on?
>
> -Krutika
>
> On Wed, May 25, 2016 at 1:28 PM, Kevin Lemonnier 
> wrote:
>
>> >Whats the underlying filesystem under the bricks?
>>
>> I use XFS, I read that was recommended. What are you using ?
>> Since yours seems to work, I'm not opposed to changing !
>>
>> --
>> Kevin Lemonnier
>> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>>
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-25 Thread Kevin Lemonnier
Hi,

Not that I know of, no. Doesn't look like the bricks have trouble
communication, but is there a simple way to check that in glusterFS,
some sort of brick uptime ? Who knows, maybe the bricks are flickering
and I don't notice, that's entirely possible.

As mentionned, the problem occurs on it's own. I can trigger it faster
by using the disk a lot (doing database import) but it occured this night
for example and I wasn't using the machine at all. I googled a bit and I
found quite a lot of thread on the proxmox forum about this but for older
versions of glusterFS.

I am using qcow2 usually, just tried with raw and same problem. I just mounted
the volume with NFS, and I'm currently moving the disk on it to see if the
problem is libgfapi only or if it happens too with NFS.


On Wed, May 25, 2016 at 02:38:00PM +0530, Krutika Dhananjay wrote:
>Hi Kevin,
> 
>If you actually ran into a 'read-only filesystem' issue, then it could
>possibly because of a bug in AFR
>that Pranith recently fixed.
>To confirm if that is indeed the case, could you tell meA  if you saw the
>pause after a brick (single brick) was
>down while IO was going on?
> 
>-Krutika
>On Wed, May 25, 2016 at 1:28 PM, Kevin Lemonnier 
>wrote:
> 
>  >A  A  Whats the underlying filesystem under the bricks?
> 
>  I use XFS, I read that was recommended. What are you using ?
>  Since yours seems to work, I'm not opposed to changing !
>  --
>  Kevin Lemonnier
>  PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>  ___
>  Gluster-users mailing list
>  Gluster-users@gluster.org
>  http://www.gluster.org/mailman/listinfo/gluster-users

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111


signature.asc
Description: Digital signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-25 Thread Krutika Dhananjay
Hi Kevin,


If you actually ran into a 'read-only filesystem' issue, then it could
possibly because of a bug in AFR
that Pranith recently fixed.
To confirm if that is indeed the case, could you tell me  if you saw the
pause after a brick (single brick) was
down while IO was going on?

-Krutika

On Wed, May 25, 2016 at 1:28 PM, Kevin Lemonnier 
wrote:

> >Whats the underlying filesystem under the bricks?
>
> I use XFS, I read that was recommended. What are you using ?
> Since yours seems to work, I'm not opposed to changing !
>
> --
> Kevin Lemonnier
> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-25 Thread Kevin Lemonnier
>Whats the underlying filesystem under the bricks?

I use XFS, I read that was recommended. What are you using ?
Since yours seems to work, I'm not opposed to changing !

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111


signature.asc
Description: Digital signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-25 Thread Lindsay Mathieson

On 25/05/2016 5:36 PM, Kevin Lemonnier wrote:

Nope, not solved !
Looks like directsync just delays the problem, this morning the VM had
thrown a bunch of I/O errors again. Tried writethrough and it seems to
behave exactly like cache=none, the errors appear in a few minutes.
Trying again with directsync and no errors for now, so it looks like
directsync is better than nothing, but still doesn't solve the problem.




Bummer :(


Whats the underlying filesystem under the bricks?

--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-25 Thread Kevin Lemonnier
Nope, not solved !
Looks like directsync just delays the problem, this morning the VM had
thrown a bunch of I/O errors again. Tried writethrough and it seems to
behave exactly like cache=none, the errors appear in a few minutes.
Trying again with directsync and no errors for now, so it looks like
directsync is better than nothing, but still doesn't solve the problem.

Really can't use this in production, the VM goes read only after a few
days because it saw too much I/O errors. Must be missing something

On Tue, May 24, 2016 at 12:24:44PM +0200, Kevin Lemonnier wrote:
> So the VM were configured with cache set to none, I just tried with
> cache=directsync and it seems to be fixing the issue. Still need to run
> more test, but did a couple already with that option and no I/O errors.
> 
> Never had to do this before, is it known ? Found the clue in some old mail
> from this mailing list, did I miss some doc saying you should be using
> directsync with glusterfs ?
> 
> On Tue, May 24, 2016 at 11:33:28AM +0200, Kevin Lemonnier wrote:
> > Hi,
> > 
> > Some news on this.
> > I actually don't need to trigger a heal to get corruption, so the problem
> > is not the healing. Live migrating the VM seems to trigger corruption every
> > time, and even without that just doing a database import, rebooting then
> > doing another import seems to corrupt as well.
> > 
> > To check I created local storages on each node on the same partition as the
> > gluster bricks, on XFS, and moved the VM disk on each local storage and 
> > tested
> > the same procedure one by one, no corruption. It seems to happen only on
> > glusterFS, so I'm not so sure it's hardware anymore : if it was using local 
> > storage
> > would corrupt too, right ?
> > Could I be missing some critical configuration for VM storage on my gluster 
> > volume ?
> > 
> > 
> > On Mon, May 23, 2016 at 01:54:30PM +0200, Kevin Lemonnier wrote:
> > > Hi,
> > > 
> > > I didn't specify it but I use "localhost" to add the storage in proxmox.
> > > My thinking is that every proxmox node is also a glusterFS node, so that
> > > should work fine.
> > > 
> > > I don't want to use the "normal" way of setting a regular address in there
> > > because you can't change it afterwards in proxmox, but could that be the 
> > > source of
> > > the problem, maybe during livre migration there is write comming from
> > > two different servers at the same time ?
> > > 
> > > 
> > > 
> > > On Wed, May 18, 2016 at 07:11:08PM +0530, Krutika Dhananjay wrote:
> > > >Hi,
> > > > 
> > > >I will try to recreate this issue tomorrow on my machines with the 
> > > > steps
> > > >that Lindsay provided in this thread. I will let you know the result 
> > > > soon
> > > >after that.
> > > > 
> > > >-Krutika
> > > > 
> > > >On Wednesday, May 18, 2016, Kevin Lemonnier  
> > > > wrote:
> > > >> Hi,
> > > >>
> > > >> Some news on this.
> > > >> Over the week end the RAID Card of the node ipvr2 died, and I 
> > > > thought
> > > >> that maybe that was the problem all along. The RAID Card was 
> > > > changed
> > > >> and yesterday I reinstalled everything.
> > > >> Same problem just now.
> > > >>
> > > >> My test is simple, using the website hosted on the VMs all the time
> > > >> I reboot ipvr50, wait for the heal to complete, migrate all the 
> > > > VMs off
> > > >> ipvr2 then reboot it, wait for the heal to complete then migrate 
> > > > all
> > > >> the VMs off ipvr3 then reboot it.
> > > >> Everytime the first database VM (which is the only one really 
> > > > using the
> > > >disk
> > > >> durign the heal) starts showing I/O errors on it's disk.
> > > >>
> > > >> Am I really the only one with that problem ?
> > > >> Maybe one of the drives is dying too, who knows, but SMART isn't 
> > > > saying
> > > >anything ..
> > > >>
> > > >>
> > > >> On Thu, May 12, 2016 at 04:03:02PM +0200, Kevin Lemonnier wrote:
> > > >>> Hi,
> > > >>>
> > > >>> I had a problem some time ago with 3.7.6 and freezing during 
> > > > heals,
> > > >>> and multiple persons advised to use 3.7.11 instead. Indeed, with 
> > > > that
> > > >>> version the freez problem is fixed, it works like a dream ! You 
> > > > can
> > > >>> almost not tell that a node is down or healing, everything keeps
> > > >working
> > > >>> except for a little freez when the node just went down and I 
> > > > assume
> > > >>> hasn't timed out yet, but that's fine.
> > > >>>
> > > >>> Now I have a 3.7.11 volume on 3 nodes for testing, and the VM are
> > > >proxmox
> > > >>> VMs with qCow2 disks stored on the gluster volume.
> > > >>> Here is the config :
> > > >>>
> > > >>> Volume Name: gluster
> > > >>> Type: Replicate
> > > >>> Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
> > > >>> Status: Started
> > > >>> Number of Bricks: 1 x 3 = 3
> > > >

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-24 Thread Nicolas Ecarnot

Le 24/05/2016 12:54, Lindsay Mathieson a écrit :

On 24/05/2016 8:24 PM, Kevin Lemonnier wrote:

So the VM were configured with cache set to none, I just tried with
cache=directsync and it seems to be fixing the issue. Still need to run
more test, but did a couple already with that option and no I/O errors.

Never had to do this before, is it known ? Found the clue in some old mail
from this mailing list, did I miss some doc saying you should be using
directsync with glusterfs ?


Interesting, I remember seeing some issues with cache=none on the
proxmox mailing list. I use writeback or default, which might be why I
haven't encountered theses issue. I suspect you would find writethrough
works as well.


 From the proxmox wiki:


"/This mode causes qemu-kvm to interact with the disk image file or
block device with O_DIRECT semantics, so the host page cache is bypassed //
// and I/O happens directly between the qemu-kvm userspace buffers
and the  storage device. Because the actual storage device may
report //
// a write as completed when placed in its write queue only, the
guest's virtual storage adapter is informed that there is a writeback
cache, //
// so the guest would be expected to send down flush commands as
needed to manage data integrity.//
// Equivalent to direct access to your hosts' disk, performance wise./"


I'll restore a test vm and try cache=none myself.


Hi,

Is there any risk this could also apply to oVirt VMs stored on glusterFS?
I see no place I could specify this cache setting in an oVirt+gluster setup.

--
Nicolas ECARNOT
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-24 Thread Lindsay Mathieson

On 24/05/2016 8:24 PM, Kevin Lemonnier wrote:

So the VM were configured with cache set to none, I just tried with
cache=directsync and it seems to be fixing the issue. Still need to run
more test, but did a couple already with that option and no I/O errors.

Never had to do this before, is it known ? Found the clue in some old mail
from this mailing list, did I miss some doc saying you should be using
directsync with glusterfs ?


Interesting, I remember seeing some issues with cache=none on the 
proxmox mailing list. I use writeback or default, which might be why I 
haven't encountered theses issue. I suspect you would find writethrough 
works as well.



From the proxmox wiki:


"/This mode causes qemu-kvm to interact with the disk image file or 
block device with O_DIRECT semantics, so the host page cache is bypassed //
// and I/O happens directly between the qemu-kvm userspace buffers 
and the  storage device. Because the actual storage device may 
report //
// a write as completed when placed in its write queue only, the 
guest's virtual storage adapter is informed that there is a writeback 
cache, //
// so the guest would be expected to send down flush commands as 
needed to manage data integrity.//

// Equivalent to direct access to your hosts' disk, performance wise./"


I'll restore a test vm and try cache=none myself.

--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-24 Thread Kevin Lemonnier
So the VM were configured with cache set to none, I just tried with
cache=directsync and it seems to be fixing the issue. Still need to run
more test, but did a couple already with that option and no I/O errors.

Never had to do this before, is it known ? Found the clue in some old mail
from this mailing list, did I miss some doc saying you should be using
directsync with glusterfs ?

On Tue, May 24, 2016 at 11:33:28AM +0200, Kevin Lemonnier wrote:
> Hi,
> 
> Some news on this.
> I actually don't need to trigger a heal to get corruption, so the problem
> is not the healing. Live migrating the VM seems to trigger corruption every
> time, and even without that just doing a database import, rebooting then
> doing another import seems to corrupt as well.
> 
> To check I created local storages on each node on the same partition as the
> gluster bricks, on XFS, and moved the VM disk on each local storage and tested
> the same procedure one by one, no corruption. It seems to happen only on
> glusterFS, so I'm not so sure it's hardware anymore : if it was using local 
> storage
> would corrupt too, right ?
> Could I be missing some critical configuration for VM storage on my gluster 
> volume ?
> 
> 
> On Mon, May 23, 2016 at 01:54:30PM +0200, Kevin Lemonnier wrote:
> > Hi,
> > 
> > I didn't specify it but I use "localhost" to add the storage in proxmox.
> > My thinking is that every proxmox node is also a glusterFS node, so that
> > should work fine.
> > 
> > I don't want to use the "normal" way of setting a regular address in there
> > because you can't change it afterwards in proxmox, but could that be the 
> > source of
> > the problem, maybe during livre migration there is write comming from
> > two different servers at the same time ?
> > 
> > 
> > 
> > On Wed, May 18, 2016 at 07:11:08PM +0530, Krutika Dhananjay wrote:
> > >Hi,
> > > 
> > >I will try to recreate this issue tomorrow on my machines with the 
> > > steps
> > >that Lindsay provided in this thread. I will let you know the result 
> > > soon
> > >after that.
> > > 
> > >-Krutika
> > > 
> > >On Wednesday, May 18, 2016, Kevin Lemonnier  
> > > wrote:
> > >> Hi,
> > >>
> > >> Some news on this.
> > >> Over the week end the RAID Card of the node ipvr2 died, and I thought
> > >> that maybe that was the problem all along. The RAID Card was changed
> > >> and yesterday I reinstalled everything.
> > >> Same problem just now.
> > >>
> > >> My test is simple, using the website hosted on the VMs all the time
> > >> I reboot ipvr50, wait for the heal to complete, migrate all the VMs 
> > > off
> > >> ipvr2 then reboot it, wait for the heal to complete then migrate all
> > >> the VMs off ipvr3 then reboot it.
> > >> Everytime the first database VM (which is the only one really using 
> > > the
> > >disk
> > >> durign the heal) starts showing I/O errors on it's disk.
> > >>
> > >> Am I really the only one with that problem ?
> > >> Maybe one of the drives is dying too, who knows, but SMART isn't 
> > > saying
> > >anything ..
> > >>
> > >>
> > >> On Thu, May 12, 2016 at 04:03:02PM +0200, Kevin Lemonnier wrote:
> > >>> Hi,
> > >>>
> > >>> I had a problem some time ago with 3.7.6 and freezing during heals,
> > >>> and multiple persons advised to use 3.7.11 instead. Indeed, with 
> > > that
> > >>> version the freez problem is fixed, it works like a dream ! You can
> > >>> almost not tell that a node is down or healing, everything keeps
> > >working
> > >>> except for a little freez when the node just went down and I assume
> > >>> hasn't timed out yet, but that's fine.
> > >>>
> > >>> Now I have a 3.7.11 volume on 3 nodes for testing, and the VM are
> > >proxmox
> > >>> VMs with qCow2 disks stored on the gluster volume.
> > >>> Here is the config :
> > >>>
> > >>> Volume Name: gluster
> > >>> Type: Replicate
> > >>> Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
> > >>> Status: Started
> > >>> Number of Bricks: 1 x 3 = 3
> > >>> Transport-type: tcp
> > >>> Bricks:
> > >>> Brick1: ipvr2.client:/mnt/storage/gluster
> > >>> Brick2: ipvr3.client:/mnt/storage/gluster
> > >>> Brick3: ipvr50.client:/mnt/storage/gluster
> > >>> Options Reconfigured:
> > >>> cluster.quorum-type: auto
> > >>> cluster.server-quorum-type: server
> > >>> network.remote-dio: enable
> > >>> cluster.eager-lock: enable
> > >>> performance.quick-read: off
> > >>> performance.read-ahead: off
> > >>> performance.io-cache: off
> > >>> performance.stat-prefetch: off
> > >>> features.shard: on
> > >>> features.shard-block-size: 64MB
> > >>> cluster.data-self-heal-algorithm: full
> > >>> performance.readdir-ahead: on
> > >>>
> > >>>
> > >>> As mentioned, I rebooted one of the nodes to test the freezing 
> > > 

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-24 Thread Kevin Lemonnier
Hi,

Some news on this.
I actually don't need to trigger a heal to get corruption, so the problem
is not the healing. Live migrating the VM seems to trigger corruption every
time, and even without that just doing a database import, rebooting then
doing another import seems to corrupt as well.

To check I created local storages on each node on the same partition as the
gluster bricks, on XFS, and moved the VM disk on each local storage and tested
the same procedure one by one, no corruption. It seems to happen only on
glusterFS, so I'm not so sure it's hardware anymore : if it was using local 
storage
would corrupt too, right ?
Could I be missing some critical configuration for VM storage on my gluster 
volume ?


On Mon, May 23, 2016 at 01:54:30PM +0200, Kevin Lemonnier wrote:
> Hi,
> 
> I didn't specify it but I use "localhost" to add the storage in proxmox.
> My thinking is that every proxmox node is also a glusterFS node, so that
> should work fine.
> 
> I don't want to use the "normal" way of setting a regular address in there
> because you can't change it afterwards in proxmox, but could that be the 
> source of
> the problem, maybe during livre migration there is write comming from
> two different servers at the same time ?
> 
> 
> 
> On Wed, May 18, 2016 at 07:11:08PM +0530, Krutika Dhananjay wrote:
> >Hi,
> > 
> >I will try to recreate this issue tomorrow on my machines with the steps
> >that Lindsay provided in this thread. I will let you know the result soon
> >after that.
> > 
> >-Krutika
> > 
> >On Wednesday, May 18, 2016, Kevin Lemonnier  wrote:
> >> Hi,
> >>
> >> Some news on this.
> >> Over the week end the RAID Card of the node ipvr2 died, and I thought
> >> that maybe that was the problem all along. The RAID Card was changed
> >> and yesterday I reinstalled everything.
> >> Same problem just now.
> >>
> >> My test is simple, using the website hosted on the VMs all the time
> >> I reboot ipvr50, wait for the heal to complete, migrate all the VMs off
> >> ipvr2 then reboot it, wait for the heal to complete then migrate all
> >> the VMs off ipvr3 then reboot it.
> >> Everytime the first database VM (which is the only one really using the
> >disk
> >> durign the heal) starts showing I/O errors on it's disk.
> >>
> >> Am I really the only one with that problem ?
> >> Maybe one of the drives is dying too, who knows, but SMART isn't saying
> >anything ..
> >>
> >>
> >> On Thu, May 12, 2016 at 04:03:02PM +0200, Kevin Lemonnier wrote:
> >>> Hi,
> >>>
> >>> I had a problem some time ago with 3.7.6 and freezing during heals,
> >>> and multiple persons advised to use 3.7.11 instead. Indeed, with that
> >>> version the freez problem is fixed, it works like a dream ! You can
> >>> almost not tell that a node is down or healing, everything keeps
> >working
> >>> except for a little freez when the node just went down and I assume
> >>> hasn't timed out yet, but that's fine.
> >>>
> >>> Now I have a 3.7.11 volume on 3 nodes for testing, and the VM are
> >proxmox
> >>> VMs with qCow2 disks stored on the gluster volume.
> >>> Here is the config :
> >>>
> >>> Volume Name: gluster
> >>> Type: Replicate
> >>> Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
> >>> Status: Started
> >>> Number of Bricks: 1 x 3 = 3
> >>> Transport-type: tcp
> >>> Bricks:
> >>> Brick1: ipvr2.client:/mnt/storage/gluster
> >>> Brick2: ipvr3.client:/mnt/storage/gluster
> >>> Brick3: ipvr50.client:/mnt/storage/gluster
> >>> Options Reconfigured:
> >>> cluster.quorum-type: auto
> >>> cluster.server-quorum-type: server
> >>> network.remote-dio: enable
> >>> cluster.eager-lock: enable
> >>> performance.quick-read: off
> >>> performance.read-ahead: off
> >>> performance.io-cache: off
> >>> performance.stat-prefetch: off
> >>> features.shard: on
> >>> features.shard-block-size: 64MB
> >>> cluster.data-self-heal-algorithm: full
> >>> performance.readdir-ahead: on
> >>>
> >>>
> >>> As mentioned, I rebooted one of the nodes to test the freezing issue I
> >had
> >>> on previous versions and appart from the initial timeout, nothing, the
> >website
> >>> hosted on the VMs keeps working like a charm even during heal.
> >>> Since it's testing, there isn't any load on it though, and I just 
> > tried
> >to refresh
> >>> the database by importing the production one on the two MySQL VMs, and
> >both of them
> >>> started doing I/O errors. I tried shutting them down and powering them
> >on again,
> >>> but same thing, even starting full heals by hand doesn't solve the
> >problem, the disks are
> >>> corrupted. They still work, but sometimes they remount their 
> > partitions
> >read only ..
> >>>
> >  

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-23 Thread Kevin Lemonnier
Hi,

I didn't specify it but I use "localhost" to add the storage in proxmox.
My thinking is that every proxmox node is also a glusterFS node, so that
should work fine.

I don't want to use the "normal" way of setting a regular address in there
because you can't change it afterwards in proxmox, but could that be the source 
of
the problem, maybe during livre migration there is write comming from
two different servers at the same time ?



On Wed, May 18, 2016 at 07:11:08PM +0530, Krutika Dhananjay wrote:
>Hi,
> 
>I will try to recreate this issue tomorrow on my machines with the steps
>that Lindsay provided in this thread. I will let you know the result soon
>after that.
> 
>-Krutika
> 
>On Wednesday, May 18, 2016, Kevin Lemonnier  wrote:
>> Hi,
>>
>> Some news on this.
>> Over the week end the RAID Card of the node ipvr2 died, and I thought
>> that maybe that was the problem all along. The RAID Card was changed
>> and yesterday I reinstalled everything.
>> Same problem just now.
>>
>> My test is simple, using the website hosted on the VMs all the time
>> I reboot ipvr50, wait for the heal to complete, migrate all the VMs off
>> ipvr2 then reboot it, wait for the heal to complete then migrate all
>> the VMs off ipvr3 then reboot it.
>> Everytime the first database VM (which is the only one really using the
>disk
>> durign the heal) starts showing I/O errors on it's disk.
>>
>> Am I really the only one with that problem ?
>> Maybe one of the drives is dying too, who knows, but SMART isn't saying
>anything ..
>>
>>
>> On Thu, May 12, 2016 at 04:03:02PM +0200, Kevin Lemonnier wrote:
>>> Hi,
>>>
>>> I had a problem some time ago with 3.7.6 and freezing during heals,
>>> and multiple persons advised to use 3.7.11 instead. Indeed, with that
>>> version the freez problem is fixed, it works like a dream ! You can
>>> almost not tell that a node is down or healing, everything keeps
>working
>>> except for a little freez when the node just went down and I assume
>>> hasn't timed out yet, but that's fine.
>>>
>>> Now I have a 3.7.11 volume on 3 nodes for testing, and the VM are
>proxmox
>>> VMs with qCow2 disks stored on the gluster volume.
>>> Here is the config :
>>>
>>> Volume Name: gluster
>>> Type: Replicate
>>> Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
>>> Status: Started
>>> Number of Bricks: 1 x 3 = 3
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: ipvr2.client:/mnt/storage/gluster
>>> Brick2: ipvr3.client:/mnt/storage/gluster
>>> Brick3: ipvr50.client:/mnt/storage/gluster
>>> Options Reconfigured:
>>> cluster.quorum-type: auto
>>> cluster.server-quorum-type: server
>>> network.remote-dio: enable
>>> cluster.eager-lock: enable
>>> performance.quick-read: off
>>> performance.read-ahead: off
>>> performance.io-cache: off
>>> performance.stat-prefetch: off
>>> features.shard: on
>>> features.shard-block-size: 64MB
>>> cluster.data-self-heal-algorithm: full
>>> performance.readdir-ahead: on
>>>
>>>
>>> As mentioned, I rebooted one of the nodes to test the freezing issue I
>had
>>> on previous versions and appart from the initial timeout, nothing, the
>website
>>> hosted on the VMs keeps working like a charm even during heal.
>>> Since it's testing, there isn't any load on it though, and I just tried
>to refresh
>>> the database by importing the production one on the two MySQL VMs, and
>both of them
>>> started doing I/O errors. I tried shutting them down and powering them
>on again,
>>> but same thing, even starting full heals by hand doesn't solve the
>problem, the disks are
>>> corrupted. They still work, but sometimes they remount their partitions
>read only ..
>>>
>>> I believe there is a few people already using 3.7.11, no one noticed
>corruption problems ?
>>> Anyone using Proxmox ? As already mentionned in multiple other threads
>on this mailing list
>>> by other users, I also have pretty much always shards in heal info, but
>nothing "stuck" there,
>>> they always go away in a few seconds getting replaced by other shards.
>>>
>>> Thanks
>>>
>>> --
>>> Kevin Lemonnier
>>> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>>
>>
>>
>>> ___
>>> Gluster-users mailing list
>>> Gluster-users@gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>>
>> --
>> Kevin Lemonnier
>> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>>

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111


signature.asc
Description: Digital signature
___
Gluster-users mailing 

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-19 Thread David Gossage
*David Gossage*
*Carousel Checks Inc. | System Administrator*
*Office* 708.613.2284

On Thu, May 19, 2016 at 7:25 PM, Kevin Lemonnier 
wrote:

> The I/O errors are happening after, not during the heal.
> As described, I just rebooted a node, waited for the heal to finish,
> rebooted another, waited for the heal to finish then rebooted the third.
> From that point, the VM just has a lot of I/O errors showing whenever I
> use the disk a lot (importing big MySQL dumps). The VM "screen" on the
> console
> tab of proxmox just spams I/O errors from that point, which it didn't
> before rebooting
> the gluster nodes. Tried to poweroff the VM and force full heals, but I
> didn't find
> a way to fix the problem short of deleting the VM disk and restoring it
> from a backup.
>
> I have 3 other servers on 3.7.6 where that problem isn't happening, so it
> might be a 3.7.11 bug,
> but since the raid card failed recently on one of the nodes I'm not really
> sure some other
> piece of hardware isn't at fault .. Unfortunatly I don't have the hardware
> to test that.
> The only way to be sure would be to upgrade the 3.7.6 nodes to 3.7.11 and
> repeat the same tests,
> but those nodes are in production and the VM freezes during the heal last
> month already
> caused huge problems for our clients, really can't afford any other
> problems there,
> so testing on them isn't an option.
>
>
Are the 3.7.11 nodes in production?  Could they be downgraded to 3.7.6 and
see if problem still occurs?


> To sum up, I have 3 nodes on 3.7.6 with no corruption happening but huge
> freezes during heals,
> and 3 other nodes on 3.7.11 with no freezes during heal but corruption.
> qemu-img doesn't see the
> corruption, it only shows on the VM's screen and seems mostly harmless,
> but sometimes the VM
> does switch to read-only mode saying it had too many I/O errors.
>
> Would the bitrot detection deamon detect a hardware problem ? I did enable
> it but it didn't
> detect anything, although I don't know how to force a check on it, no idea
> if it ran a scrub
> since the corruption happened.
>
>
> On Thu, May 19, 2016 at 04:04:49PM -0400, Alastair Neil wrote:
> >I am slightly confused you say you have image file corruption but
> then you
> >say the qemu-img check says there is no corruption.A  If what you
> mean is
> >that you see I/O errors during a heal this is likely to be due to io
> >starvation, something that is a well know issue.
> >There is work happening to improve this in version 3.8:
> >https://bugzilla.redhat.com/show_bug.cgi?id=1269461
> >On 19 May 2016 at 09:58, Kevin Lemonnier 
> wrote:
> >
> >  That's a different problem then, I have corruption without removing
> or
> >  adding bricks,
> >  as mentionned. Might be two separate issue
> >
> >  On Thu, May 19, 2016 at 11:25:34PM +1000, Lindsay Mathieson wrote:
> >  >A  A  On 19/05/2016 12:17 AM, Lindsay Mathieson wrote:
> >  >
> >  >A  A  A  One thought - since the VM's are active while the brick is
> >  >A  A  A  removed/re-added, could it be the shards that are written
> >  while the
> >  >A  A  A  brick is added that are the reverse healing shards?
> >  >
> >  >A  A  I tested by:
> >  >
> >  >A  A  - removing brick 3
> >  >
> >  >A  A  - erasing brick 3
> >  >
> >  >A  A  - closing down all VM's
> >  >
> >  >A  A  - adding new brick 3
> >  >
> >  >A  A  - waiting until heal number reached its max and started
> >  decreasing
> >  >
> >  >A  A  A  There were no reverse heals
> >  >
> >  >A  A  - Started the VM's backup. No real issues there though one
> showed
> >  IO
> >  >A  A  errors, presumably due to shards being locked as they were
> >  healed.
> >  >
> >  >A  A  - VM's started ok, no reverse heals were noted and eventually
> >  Brick 3 was
> >  >A  A  fully healed. The VM's do not appear to be corrupted.
> >  >
> >  >A  A  So it would appear the problem is adding a brick while the
> volume
> >  is being
> >  >A  A  written to.
> >  >
> >  >A  A  Cheers,
> >  >
> >  >A  --
> >  >A  Lindsay Mathieson
> >
> >  > ___
> >  > Gluster-users mailing list
> >  > Gluster-users@gluster.org
> >  > http://www.gluster.org/mailman/listinfo/gluster-users
> >
> >  --
> >  Kevin Lemonnier
> >  PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> >  ___
> >  Gluster-users mailing list
> >  Gluster-users@gluster.org
> >  http://www.gluster.org/mailman/listinfo/gluster-users
>
> --
> Kevin Lemonnier
> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-19 Thread Kevin Lemonnier
The I/O errors are happening after, not during the heal.
As described, I just rebooted a node, waited for the heal to finish,
rebooted another, waited for the heal to finish then rebooted the third.
From that point, the VM just has a lot of I/O errors showing whenever I
use the disk a lot (importing big MySQL dumps). The VM "screen" on the console
tab of proxmox just spams I/O errors from that point, which it didn't before 
rebooting
the gluster nodes. Tried to poweroff the VM and force full heals, but I didn't 
find
a way to fix the problem short of deleting the VM disk and restoring it from a 
backup.

I have 3 other servers on 3.7.6 where that problem isn't happening, so it might 
be a 3.7.11 bug,
but since the raid card failed recently on one of the nodes I'm not really sure 
some other
piece of hardware isn't at fault .. Unfortunatly I don't have the hardware to 
test that.
The only way to be sure would be to upgrade the 3.7.6 nodes to 3.7.11 and 
repeat the same tests,
but those nodes are in production and the VM freezes during the heal last month 
already
caused huge problems for our clients, really can't afford any other problems 
there,
so testing on them isn't an option.

To sum up, I have 3 nodes on 3.7.6 with no corruption happening but huge 
freezes during heals,
and 3 other nodes on 3.7.11 with no freezes during heal but corruption. 
qemu-img doesn't see the
corruption, it only shows on the VM's screen and seems mostly harmless, but 
sometimes the VM
does switch to read-only mode saying it had too many I/O errors.

Would the bitrot detection deamon detect a hardware problem ? I did enable it 
but it didn't
detect anything, although I don't know how to force a check on it, no idea if 
it ran a scrub
since the corruption happened.


On Thu, May 19, 2016 at 04:04:49PM -0400, Alastair Neil wrote:
>I am slightly confused you say you have image file corruption but then you
>say the qemu-img check says there is no corruption.A  If what you mean is
>that you see I/O errors during a heal this is likely to be due to io
>starvation, something that is a well know issue.
>There is work happening to improve this in version 3.8:
>https://bugzilla.redhat.com/show_bug.cgi?id=1269461
>On 19 May 2016 at 09:58, Kevin Lemonnier  wrote:
> 
>  That's a different problem then, I have corruption without removing or
>  adding bricks,
>  as mentionned. Might be two separate issue
> 
>  On Thu, May 19, 2016 at 11:25:34PM +1000, Lindsay Mathieson wrote:
>  >A  A  On 19/05/2016 12:17 AM, Lindsay Mathieson wrote:
>  >
>  >A  A  A  One thought - since the VM's are active while the brick is
>  >A  A  A  removed/re-added, could it be the shards that are written
>  while the
>  >A  A  A  brick is added that are the reverse healing shards?
>  >
>  >A  A  I tested by:
>  >
>  >A  A  - removing brick 3
>  >
>  >A  A  - erasing brick 3
>  >
>  >A  A  - closing down all VM's
>  >
>  >A  A  - adding new brick 3
>  >
>  >A  A  - waiting until heal number reached its max and started
>  decreasing
>  >
>  >A  A  A  There were no reverse heals
>  >
>  >A  A  - Started the VM's backup. No real issues there though one showed
>  IO
>  >A  A  errors, presumably due to shards being locked as they were
>  healed.
>  >
>  >A  A  - VM's started ok, no reverse heals were noted and eventually
>  Brick 3 was
>  >A  A  fully healed. The VM's do not appear to be corrupted.
>  >
>  >A  A  So it would appear the problem is adding a brick while the volume
>  is being
>  >A  A  written to.
>  >
>  >A  A  Cheers,
>  >
>  >A  --
>  >A  Lindsay Mathieson
> 
>  > ___
>  > Gluster-users mailing list
>  > Gluster-users@gluster.org
>  > http://www.gluster.org/mailman/listinfo/gluster-users
> 
>  --
>  Kevin Lemonnier
>  PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>  ___
>  Gluster-users mailing list
>  Gluster-users@gluster.org
>  http://www.gluster.org/mailman/listinfo/gluster-users

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111


signature.asc
Description: Digital signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-19 Thread Alastair Neil
I am slightly confused you say you have image file corruption but then you
say the qemu-img check says there is no corruption.  If what you mean is
that you see I/O errors during a heal this is likely to be due to io
starvation, something that is a well know issue.

There is work happening to improve this in version 3.8:

https://bugzilla.redhat.com/show_bug.cgi?id=1269461



On 19 May 2016 at 09:58, Kevin Lemonnier  wrote:

> That's a different problem then, I have corruption without removing or
> adding bricks,
> as mentionned. Might be two separate issue
>
>
> On Thu, May 19, 2016 at 11:25:34PM +1000, Lindsay Mathieson wrote:
> >On 19/05/2016 12:17 AM, Lindsay Mathieson wrote:
> >
> >  One thought - since the VM's are active while the brick is
> >  removed/re-added, could it be the shards that are written while the
> >  brick is added that are the reverse healing shards?
> >
> >I tested by:
> >
> >- removing brick 3
> >
> >- erasing brick 3
> >
> >- closing down all VM's
> >
> >- adding new brick 3
> >
> >- waiting until heal number reached its max and started decreasing
> >
> >  There were no reverse heals
> >
> >- Started the VM's backup. No real issues there though one showed IO
> >errors, presumably due to shards being locked as they were healed.
> >
> >- VM's started ok, no reverse heals were noted and eventually Brick 3
> was
> >fully healed. The VM's do not appear to be corrupted.
> >
> >So it would appear the problem is adding a brick while the volume is
> being
> >written to.
> >
> >Cheers,
> >
> >  --
> >  Lindsay Mathieson
>
> > ___
> > Gluster-users mailing list
> > Gluster-users@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
>
>
> --
> Kevin Lemonnier
> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-19 Thread Lindsay Mathieson

On 19/05/2016 12:17 AM, Lindsay Mathieson wrote:
One thought - since the VM's are active while the brick is 
removed/re-added, could it be the shards that are written while the 
brick is added that are the reverse healing shards?


I tested by:

- removing brick 3

- erasing brick 3

- closing down all VM's

- adding new brick 3

- waiting until heal number reached its max and started decreasing

  There were no reverse heals

- Started the VM's backup. No real issues there though one showed IO 
errors, presumably due to shards being locked as they were healed.


- VM's started ok, no reverse heals were noted and eventually Brick 3 
was fully healed. The VM's do not appear to be corrupted.



So it would appear the problem is adding a brick while the volume is 
being written to.



Cheers,

--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-18 Thread Lindsay Mathieson

On 18/05/2016 11:41 PM, Krutika Dhananjay wrote:
I will try to recreate this issue tomorrow on my machines with the 
steps that Lindsay provided in this thread. I will let you know the 
result soon after that.


Thanks Krutika, I've been trying to get the shard stats you wanted, but 
by the time the heal info completed, the shard in question have been 
healed ... The node in question is the last node on the list :)



I'll swap them around and try tomorrow.


One thought - since the VM's are active while the brick is 
removed/re-added, could it be the shards that are written while the 
brick is added that are the reverse healing shards?



--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-18 Thread Kevin Lemonnier
Some additional details if it helps, there is no cache on the disk,
it's virtio and iothread=1. The file is in qcow and using qemu-img check
it says it's not corrupted, but when the VM is running I have I/O Errors.
As you can see in the config, performance.stat-prefetch: off but being
on a debian system I don't have the virt group, I just tried to replicate
the different settings by hand. Maybe I forgot something.

Thanks !

On Wed, May 18, 2016 at 07:11:08PM +0530, Krutika Dhananjay wrote:
> Hi,
> 
> I will try to recreate this issue tomorrow on my machines with the steps
> that Lindsay provided in this thread. I will let you know the result soon
> after that.
> 
> -Krutika
> 
> On Wednesday, May 18, 2016, Kevin Lemonnier  wrote:
> > Hi,
> >
> > Some news on this.
> > Over the week end the RAID Card of the node ipvr2 died, and I thought
> > that maybe that was the problem all along. The RAID Card was changed
> > and yesterday I reinstalled everything.
> > Same problem just now.
> >
> > My test is simple, using the website hosted on the VMs all the time
> > I reboot ipvr50, wait for the heal to complete, migrate all the VMs off
> > ipvr2 then reboot it, wait for the heal to complete then migrate all
> > the VMs off ipvr3 then reboot it.
> > Everytime the first database VM (which is the only one really using the
> disk
> > durign the heal) starts showing I/O errors on it's disk.
> >
> > Am I really the only one with that problem ?
> > Maybe one of the drives is dying too, who knows, but SMART isn't saying
> anything ..
> >
> >
> > On Thu, May 12, 2016 at 04:03:02PM +0200, Kevin Lemonnier wrote:
> >> Hi,
> >>
> >> I had a problem some time ago with 3.7.6 and freezing during heals,
> >> and multiple persons advised to use 3.7.11 instead. Indeed, with that
> >> version the freez problem is fixed, it works like a dream ! You can
> >> almost not tell that a node is down or healing, everything keeps working
> >> except for a little freez when the node just went down and I assume
> >> hasn't timed out yet, but that's fine.
> >>
> >> Now I have a 3.7.11 volume on 3 nodes for testing, and the VM are proxmox
> >> VMs with qCow2 disks stored on the gluster volume.
> >> Here is the config :
> >>
> >> Volume Name: gluster
> >> Type: Replicate
> >> Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
> >> Status: Started
> >> Number of Bricks: 1 x 3 = 3
> >> Transport-type: tcp
> >> Bricks:
> >> Brick1: ipvr2.client:/mnt/storage/gluster
> >> Brick2: ipvr3.client:/mnt/storage/gluster
> >> Brick3: ipvr50.client:/mnt/storage/gluster
> >> Options Reconfigured:
> >> cluster.quorum-type: auto
> >> cluster.server-quorum-type: server
> >> network.remote-dio: enable
> >> cluster.eager-lock: enable
> >> performance.quick-read: off
> >> performance.read-ahead: off
> >> performance.io-cache: off
> >> performance.stat-prefetch: off
> >> features.shard: on
> >> features.shard-block-size: 64MB
> >> cluster.data-self-heal-algorithm: full
> >> performance.readdir-ahead: on
> >>
> >>
> >> As mentioned, I rebooted one of the nodes to test the freezing issue I
> had
> >> on previous versions and appart from the initial timeout, nothing, the
> website
> >> hosted on the VMs keeps working like a charm even during heal.
> >> Since it's testing, there isn't any load on it though, and I just tried
> to refresh
> >> the database by importing the production one on the two MySQL VMs, and
> both of them
> >> started doing I/O errors. I tried shutting them down and powering them
> on again,
> >> but same thing, even starting full heals by hand doesn't solve the
> problem, the disks are
> >> corrupted. They still work, but sometimes they remount their partitions
> read only ..
> >>
> >> I believe there is a few people already using 3.7.11, no one noticed
> corruption problems ?
> >> Anyone using Proxmox ? As already mentionned in multiple other threads
> on this mailing list
> >> by other users, I also have pretty much always shards in heal info, but
> nothing "stuck" there,
> >> they always go away in a few seconds getting replaced by other shards.
> >>
> >> Thanks
> >>
> >> --
> >> Kevin Lemonnier
> >> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> >
> >
> >
> >> ___
> >> Gluster-users mailing list
> >> Gluster-users@gluster.org
> >> http://www.gluster.org/mailman/listinfo/gluster-users
> >
> >
> > --
> > Kevin Lemonnier
> > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> >

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111


signature.asc
Description: Digital signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-18 Thread Krutika Dhananjay
Hi,

I will try to recreate this issue tomorrow on my machines with the steps
that Lindsay provided in this thread. I will let you know the result soon
after that.

-Krutika

On Wednesday, May 18, 2016, Kevin Lemonnier  wrote:
> Hi,
>
> Some news on this.
> Over the week end the RAID Card of the node ipvr2 died, and I thought
> that maybe that was the problem all along. The RAID Card was changed
> and yesterday I reinstalled everything.
> Same problem just now.
>
> My test is simple, using the website hosted on the VMs all the time
> I reboot ipvr50, wait for the heal to complete, migrate all the VMs off
> ipvr2 then reboot it, wait for the heal to complete then migrate all
> the VMs off ipvr3 then reboot it.
> Everytime the first database VM (which is the only one really using the
disk
> durign the heal) starts showing I/O errors on it's disk.
>
> Am I really the only one with that problem ?
> Maybe one of the drives is dying too, who knows, but SMART isn't saying
anything ..
>
>
> On Thu, May 12, 2016 at 04:03:02PM +0200, Kevin Lemonnier wrote:
>> Hi,
>>
>> I had a problem some time ago with 3.7.6 and freezing during heals,
>> and multiple persons advised to use 3.7.11 instead. Indeed, with that
>> version the freez problem is fixed, it works like a dream ! You can
>> almost not tell that a node is down or healing, everything keeps working
>> except for a little freez when the node just went down and I assume
>> hasn't timed out yet, but that's fine.
>>
>> Now I have a 3.7.11 volume on 3 nodes for testing, and the VM are proxmox
>> VMs with qCow2 disks stored on the gluster volume.
>> Here is the config :
>>
>> Volume Name: gluster
>> Type: Replicate
>> Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
>> Status: Started
>> Number of Bricks: 1 x 3 = 3
>> Transport-type: tcp
>> Bricks:
>> Brick1: ipvr2.client:/mnt/storage/gluster
>> Brick2: ipvr3.client:/mnt/storage/gluster
>> Brick3: ipvr50.client:/mnt/storage/gluster
>> Options Reconfigured:
>> cluster.quorum-type: auto
>> cluster.server-quorum-type: server
>> network.remote-dio: enable
>> cluster.eager-lock: enable
>> performance.quick-read: off
>> performance.read-ahead: off
>> performance.io-cache: off
>> performance.stat-prefetch: off
>> features.shard: on
>> features.shard-block-size: 64MB
>> cluster.data-self-heal-algorithm: full
>> performance.readdir-ahead: on
>>
>>
>> As mentioned, I rebooted one of the nodes to test the freezing issue I
had
>> on previous versions and appart from the initial timeout, nothing, the
website
>> hosted on the VMs keeps working like a charm even during heal.
>> Since it's testing, there isn't any load on it though, and I just tried
to refresh
>> the database by importing the production one on the two MySQL VMs, and
both of them
>> started doing I/O errors. I tried shutting them down and powering them
on again,
>> but same thing, even starting full heals by hand doesn't solve the
problem, the disks are
>> corrupted. They still work, but sometimes they remount their partitions
read only ..
>>
>> I believe there is a few people already using 3.7.11, no one noticed
corruption problems ?
>> Anyone using Proxmox ? As already mentionned in multiple other threads
on this mailing list
>> by other users, I also have pretty much always shards in heal info, but
nothing "stuck" there,
>> they always go away in a few seconds getting replaced by other shards.
>>
>> Thanks
>>
>> --
>> Kevin Lemonnier
>> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>
>
>
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>
>
> --
> Kevin Lemonnier
> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-18 Thread Kevin Lemonnier
Hi,

Some news on this.
Over the week end the RAID Card of the node ipvr2 died, and I thought
that maybe that was the problem all along. The RAID Card was changed
and yesterday I reinstalled everything.
Same problem just now.

My test is simple, using the website hosted on the VMs all the time
I reboot ipvr50, wait for the heal to complete, migrate all the VMs off
ipvr2 then reboot it, wait for the heal to complete then migrate all
the VMs off ipvr3 then reboot it.
Everytime the first database VM (which is the only one really using the disk
durign the heal) starts showing I/O errors on it's disk.

Am I really the only one with that problem ?
Maybe one of the drives is dying too, who knows, but SMART isn't saying 
anything ..


On Thu, May 12, 2016 at 04:03:02PM +0200, Kevin Lemonnier wrote:
> Hi,
> 
> I had a problem some time ago with 3.7.6 and freezing during heals,
> and multiple persons advised to use 3.7.11 instead. Indeed, with that
> version the freez problem is fixed, it works like a dream ! You can
> almost not tell that a node is down or healing, everything keeps working
> except for a little freez when the node just went down and I assume
> hasn't timed out yet, but that's fine.
> 
> Now I have a 3.7.11 volume on 3 nodes for testing, and the VM are proxmox
> VMs with qCow2 disks stored on the gluster volume.
> Here is the config :
> 
> Volume Name: gluster
> Type: Replicate
> Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: ipvr2.client:/mnt/storage/gluster
> Brick2: ipvr3.client:/mnt/storage/gluster
> Brick3: ipvr50.client:/mnt/storage/gluster
> Options Reconfigured:
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> network.remote-dio: enable
> cluster.eager-lock: enable
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: off
> features.shard: on
> features.shard-block-size: 64MB
> cluster.data-self-heal-algorithm: full
> performance.readdir-ahead: on
> 
> 
> As mentioned, I rebooted one of the nodes to test the freezing issue I had
> on previous versions and appart from the initial timeout, nothing, the website
> hosted on the VMs keeps working like a charm even during heal.
> Since it's testing, there isn't any load on it though, and I just tried to 
> refresh
> the database by importing the production one on the two MySQL VMs, and both 
> of them
> started doing I/O errors. I tried shutting them down and powering them on 
> again,
> but same thing, even starting full heals by hand doesn't solve the problem, 
> the disks are
> corrupted. They still work, but sometimes they remount their partitions read 
> only ..
> 
> I believe there is a few people already using 3.7.11, no one noticed 
> corruption problems ?
> Anyone using Proxmox ? As already mentionned in multiple other threads on 
> this mailing list
> by other users, I also have pretty much always shards in heal info, but 
> nothing "stuck" there,
> they always go away in a few seconds getting replaced by other shards.
> 
> Thanks
> 
> -- 
> Kevin Lemonnier
> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111



> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users


-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111


signature.asc
Description: Digital signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-12 Thread Kevin Lemonnier
As discussed, the missing ipvr50 log file.

On Thu, May 12, 2016 at 04:24:14PM +0200, Kevin Lemonnier wrote:
> As requested on IRC, here are the logs on the 3 nodes.
> 
> On Thu, May 12, 2016 at 04:03:02PM +0200, Kevin Lemonnier wrote:
> > Hi,
> > 
> > I had a problem some time ago with 3.7.6 and freezing during heals,
> > and multiple persons advised to use 3.7.11 instead. Indeed, with that
> > version the freez problem is fixed, it works like a dream ! You can
> > almost not tell that a node is down or healing, everything keeps working
> > except for a little freez when the node just went down and I assume
> > hasn't timed out yet, but that's fine.
> > 
> > Now I have a 3.7.11 volume on 3 nodes for testing, and the VM are proxmox
> > VMs with qCow2 disks stored on the gluster volume.
> > Here is the config :
> > 
> > Volume Name: gluster
> > Type: Replicate
> > Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
> > Status: Started
> > Number of Bricks: 1 x 3 = 3
> > Transport-type: tcp
> > Bricks:
> > Brick1: ipvr2.client:/mnt/storage/gluster
> > Brick2: ipvr3.client:/mnt/storage/gluster
> > Brick3: ipvr50.client:/mnt/storage/gluster
> > Options Reconfigured:
> > cluster.quorum-type: auto
> > cluster.server-quorum-type: server
> > network.remote-dio: enable
> > cluster.eager-lock: enable
> > performance.quick-read: off
> > performance.read-ahead: off
> > performance.io-cache: off
> > performance.stat-prefetch: off
> > features.shard: on
> > features.shard-block-size: 64MB
> > cluster.data-self-heal-algorithm: full
> > performance.readdir-ahead: on
> > 
> > 
> > As mentioned, I rebooted one of the nodes to test the freezing issue I had
> > on previous versions and appart from the initial timeout, nothing, the 
> > website
> > hosted on the VMs keeps working like a charm even during heal.
> > Since it's testing, there isn't any load on it though, and I just tried to 
> > refresh
> > the database by importing the production one on the two MySQL VMs, and both 
> > of them
> > started doing I/O errors. I tried shutting them down and powering them on 
> > again,
> > but same thing, even starting full heals by hand doesn't solve the problem, 
> > the disks are
> > corrupted. They still work, but sometimes they remount their partitions 
> > read only ..
> > 
> > I believe there is a few people already using 3.7.11, no one noticed 
> > corruption problems ?
> > Anyone using Proxmox ? As already mentionned in multiple other threads on 
> > this mailing list
> > by other users, I also have pretty much always shards in heal info, but 
> > nothing "stuck" there,
> > they always go away in a few seconds getting replaced by other shards.
> > 
> > Thanks
> > 
> > -- 
> > Kevin Lemonnier
> > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> 
> 
> 
> > ___
> > Gluster-users mailing list
> > Gluster-users@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
> 
> 
> -- 
> Kevin Lemonnier
> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111





> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users


-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
[2016-05-09 09:05:35.262203] I [MSGID: 100030] [glusterfsd.c:2332:main] 
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.11 
(args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p 
/var/lib/glusterd/glustershd/run/glustershd.pid -l 
/var/log/glusterfs/glustershd.log -S 
/var/run/gluster/42ecf8056c8db7918ecbc3de0575911e.socket --xlator-option 
*replicate*.node-uuid=616a7c35-4483-4ceb-92a5-1ca4a1055589)
[2016-05-09 09:05:35.268593] I [MSGID: 101190] 
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with 
index 1
[2016-05-09 09:05:35.306807] I [graph.c:269:gf_add_cmdline_options] 
0-gluster-replicate-0: adding option 'node-uuid' for volume 
'gluster-replicate-0' with value '616a7c35-4483-4ceb-92a5-1ca4a1055589'
[2016-05-09 09:05:35.315326] I [MSGID: 101190] 
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with 
index 2
[2016-05-09 09:05:35.316842] I [MSGID: 114020] [client.c:2106:notify] 
0-gluster-client-0: parent translators are ready, attempting connect on 
transport
[2016-05-09 09:05:35.317417] I [MSGID: 114020] [client.c:2106:notify] 
0-gluster-client-1: parent translators are ready, attempting connect on 
transport
[2016-05-09 09:05:35.317901] I [rpc-clnt.c:1868:rpc_clnt_reconfig] 
0-gluster-client-0: changing port to 49152 (from 0)
[2016-05-09 09:05:35.317995] I [MSGID: 114020] [client.c:2106:notify] 
0-gluster-client-2: parent translators are ready, attempting connect on 
transport
Final graph:
+--+
  1: volume gluster-client-0
  2: type protocol/client
  3: option ping-timeout 42
  4: option remote-host ipvr2.client
  5: option 

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-12 Thread Lindsay Mathieson

On 13/05/2016 12:03 AM, Kevin Lemonnier wrote:

I just tried to refresh
the database by importing the production one on the two MySQL VMs, and both of 
them
started doing I/O errors.




Sorry, I don't quite undertsand what you did - you migrated 1 or 2 VM's 
onto the test gluster volume?


--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VM disks corruption on 3.7.11

2016-05-12 Thread Kevin Lemonnier
As requested on IRC, here are the logs on the 3 nodes.

On Thu, May 12, 2016 at 04:03:02PM +0200, Kevin Lemonnier wrote:
> Hi,
> 
> I had a problem some time ago with 3.7.6 and freezing during heals,
> and multiple persons advised to use 3.7.11 instead. Indeed, with that
> version the freez problem is fixed, it works like a dream ! You can
> almost not tell that a node is down or healing, everything keeps working
> except for a little freez when the node just went down and I assume
> hasn't timed out yet, but that's fine.
> 
> Now I have a 3.7.11 volume on 3 nodes for testing, and the VM are proxmox
> VMs with qCow2 disks stored on the gluster volume.
> Here is the config :
> 
> Volume Name: gluster
> Type: Replicate
> Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: ipvr2.client:/mnt/storage/gluster
> Brick2: ipvr3.client:/mnt/storage/gluster
> Brick3: ipvr50.client:/mnt/storage/gluster
> Options Reconfigured:
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> network.remote-dio: enable
> cluster.eager-lock: enable
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: off
> features.shard: on
> features.shard-block-size: 64MB
> cluster.data-self-heal-algorithm: full
> performance.readdir-ahead: on
> 
> 
> As mentioned, I rebooted one of the nodes to test the freezing issue I had
> on previous versions and appart from the initial timeout, nothing, the website
> hosted on the VMs keeps working like a charm even during heal.
> Since it's testing, there isn't any load on it though, and I just tried to 
> refresh
> the database by importing the production one on the two MySQL VMs, and both 
> of them
> started doing I/O errors. I tried shutting them down and powering them on 
> again,
> but same thing, even starting full heals by hand doesn't solve the problem, 
> the disks are
> corrupted. They still work, but sometimes they remount their partitions read 
> only ..
> 
> I believe there is a few people already using 3.7.11, no one noticed 
> corruption problems ?
> Anyone using Proxmox ? As already mentionned in multiple other threads on 
> this mailing list
> by other users, I also have pretty much always shards in heal info, but 
> nothing "stuck" there,
> they always go away in a few seconds getting replaced by other shards.
> 
> Thanks
> 
> -- 
> Kevin Lemonnier
> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111



> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users


-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111


glustershd_logs.tgz
Description: GNU Unix tar archive


signature.asc
Description: Digital signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] VM disks corruption on 3.7.11

2016-05-12 Thread Kevin Lemonnier
Hi,

I had a problem some time ago with 3.7.6 and freezing during heals,
and multiple persons advised to use 3.7.11 instead. Indeed, with that
version the freez problem is fixed, it works like a dream ! You can
almost not tell that a node is down or healing, everything keeps working
except for a little freez when the node just went down and I assume
hasn't timed out yet, but that's fine.

Now I have a 3.7.11 volume on 3 nodes for testing, and the VM are proxmox
VMs with qCow2 disks stored on the gluster volume.
Here is the config :

Volume Name: gluster
Type: Replicate
Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: ipvr2.client:/mnt/storage/gluster
Brick2: ipvr3.client:/mnt/storage/gluster
Brick3: ipvr50.client:/mnt/storage/gluster
Options Reconfigured:
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 64MB
cluster.data-self-heal-algorithm: full
performance.readdir-ahead: on


As mentioned, I rebooted one of the nodes to test the freezing issue I had
on previous versions and appart from the initial timeout, nothing, the website
hosted on the VMs keeps working like a charm even during heal.
Since it's testing, there isn't any load on it though, and I just tried to 
refresh
the database by importing the production one on the two MySQL VMs, and both of 
them
started doing I/O errors. I tried shutting them down and powering them on again,
but same thing, even starting full heals by hand doesn't solve the problem, the 
disks are
corrupted. They still work, but sometimes they remount their partitions read 
only ..

I believe there is a few people already using 3.7.11, no one noticed corruption 
problems ?
Anyone using Proxmox ? As already mentionned in multiple other threads on this 
mailing list
by other users, I also have pretty much always shards in heal info, but nothing 
"stuck" there,
they always go away in a few seconds getting replaced by other shards.

Thanks

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111


signature.asc
Description: Digital signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users