[Gluster-users] peer probe failures

2017-04-03 Thread Kenneth Talley
Hey all,

I've got a strange problem going on here. I've installed glusterfs-server
on ubuntu 16.04:
glusterfs-client/xenial,now 3.7.6-1ubuntu1 amd64 [installed,automatic]
glusterfs-common/xenial,now 3.7.6-1ubuntu1 amd64 [installed,automatic]
glusterfs-server/xenial,now 3.7.6-1ubuntu1 amd64 [installed]

I can successfully probe another peer at this point. Then, after installing
kubernetes via kargo, peer probing begins failing with a timeout. I've
tried stopping all kubernetes related services, and flushing all iptables
rules, however I don't see any packets leaving any interface when
attempting to peer probe.

from cli.log:
[2017-04-03 22:20:24.704900] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
[2017-04-03 22:20:24.704973] T [cli.c:273:cli_rpc_notify] 0-glusterfs: got
RPC_CLNT_CONNECT
[2017-04-03 22:20:24.705001] T [cli-quotad-client.c:94:cli_quotad_notify]
0-glusterfs: got RPC_CLNT_CONNECT
[2017-04-03 22:20:24.705014] I [socket.c:2355:socket_event_handler]
0-transport: disconnecting now
[2017-04-03 22:20:24.705204] T [rpc-clnt.c:1404:rpc_clnt_record]
0-glusterfs: Auth Info: pid: 0, uid: 0, gid: 0, owner:
[2017-04-03 22:20:24.705256] T
[rpc-clnt.c:1261:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen
156, payload: 92, rpc hdr: 64
[2017-04-03 22:20:24.705662] T [socket.c:2879:socket_connect] (-->
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f012fd21953]
(--> /usr/lib/x86_64-linux-gnu
/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x84)[0x7f012f69add4] (-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x55)[0x7f012f697af5]
(--> /usr/lib/x8
6_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f012f698338] (-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f012f6945b3]
) 0-glusterfs: connect
() called on transport already connected
[2017-04-03 22:20:24.705680] D
[rpc-clnt-ping.c:98:rpc_clnt_remove_ping_timer_locked] (-->
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f012fd21953]
(--> /
usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x84)[0x7f012f69add4]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x55)[0x7f012f
697af5] (-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f012f698338]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f012f6945b3]
)))
)) 0-: /var/run/gluster/quotad.socket: ping timer event already removed
[2017-04-03 22:20:24.705710] T [cli-quotad-client.c:100:cli_quotad_notify]
0-glusterfs: got RPC_CLNT_DISCONNECT
[2017-04-03 22:20:24.705718] T [rpc-clnt.c:1598:rpc_clnt_submit]
0-rpc-clnt: submitted request (XID: 0x1 Program: Gluster CLI, ProgVers: 2,
Proc: 1) to rpc-transport (glusterfs)
[2017-04-03 22:20:24.705739] D [rpc-clnt-ping.c:281:rpc_clnt_start_ping]
0-glusterfs: ping timeout is 0, returning
[2017-04-03 22:20:24.705723] D [MSGID: 0]
[event-epoll.c:591:event_dispatch_epoll_handler] 0-epoll: generation bumped
on idx=1 from gen=1 to slot->gen=2, fd=7, slot->fd=7
[2017-04-03 22:20:27.614881] T [rpc-clnt.c:418:rpc_clnt_reconnect]
0-glusterfs: attempting reconnect
[2017-04-03 22:20:27.615151] T [socket.c:2879:socket_connect] (-->
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f012fd21953]
(-->
/usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/rpc-transport/socket.so(+0x6c1b)[0x7f012a697c1b]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9)[0x7f012f695999]
(-->
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xfc)[0x7f012fd3d70c]
(--> /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f012f0b86ba] )
0-glusterfs: connect () called on transport already connected

it then repeats the following:
[2017-04-03 22:20:27.615177] T [rpc-clnt.c:418:rpc_clnt_reconnect]
0-glusterfs: attempting reconnect
[2017-04-03 22:20:27.615188] T [socket.c:2887:socket_connect] 0-glusterfs:
connecting 0x25d3550, state=0 gen=0 sock=-1
[2017-04-03 22:20:27.615200] T
[name.c:295:af_unix_client_get_remote_sockaddr] 0-glusterfs: using
connect-path /var/run/gluster/quotad.socket
[2017-04-03 22:20:27.615218] T [name.c:111:af_unix_client_bind]
0-glusterfs: bind-path not specified for unix socket, letting connect to
assign default value
[2017-04-03 22:20:27.615329] T [cli-quotad-client.c:94:cli_quotad_notify]
0-glusterfs: got RPC_CLNT_CONNECT
[2017-04-03 22:20:27.615355] I [socket.c:2355:socket_event_handler]
0-transport: disconnecting now
[2017-04-03 22:20:27.615567] D
[rpc-clnt-ping.c:98:rpc_clnt_remove_ping_timer_locked] (-->
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f012fd21953]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x84)[0x7f012f69add4]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x55)[0x7f012f697af5]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f012f698338]
(-->

[Gluster-users] Special Config or not Possible (5 bricks)

2017-04-03 Thread Holger Rojahn

Hi,

i have 5 DIsks (3 3 TB, 2 3 TB)
Actual running Raid 5 on the 3 3 TB Disks and Raid 1 on the  3 TB DIsks
(dont diskuss about Raid 5 .. i know it :)
Plan was to use gluster with sharding enabled replica 3 on all 5 disks 
so parts of the File are at minimum of 3 disks (localy make 5 bricks ...)


i have tried it with loop files (5 pieces) but this configuration seems 
to not work. I can enable 4 devices but not the 5th.(in my Test only 
replica count 2 ... final plan is replica 3 ...)



Config:
root@ICE-NAS:~# gluster volume info vol1

Volume Name: vol1
Type: Distributed-Replicate
Volume ID: 16f67225-4949-4510-85c2-0486aa089ec3
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: ice-nas:/gltest/br1/ds
Brick2: ice-nas:/gltest/br2/ds
Brick3: ice-nas:/gltest/br3/ds
Brick4: ice-nas:/gltest/br4/ds
Options Reconfigured:
features.shard-block-size: 128MB
features.shard: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

Command:
root@ICE-NAS:~# gluster volume add-brick vol1 icenas:/gltest/br5/ds
volume add-brick: failed: Incorrect number of bricks supplied 1 with count 2

When i understand the Manual right i can also add a single Brick with 
sharding enabled !?

Where is my mistake ...

Greets from Germany
Holger

___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Working and up to date guide for ganesha ? nfs-ganesha gluster-ganesha

2017-04-03 Thread Travis Eddy
Hello,
I've tried all the guides I can find. The is alot of descrepency on the
ganesha.conf file and how it interacts with gluster. none of the examples I
found worked, also none of them have been updated in the last year or so,
even Redhat's.

any one have a link to a working guide for ganesha?


Also still looking for help with Gluster hosting VM images for Xenserver.
The 6MB/sec best we've seen is sad. Bland NFS with async gives us 110
MB/sec .

Thanks

Travis Eddy
Smartware IT
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster Performance in an Ovirt Scenario.

2017-04-03 Thread Darrell Budic
You didn’t list your mount type for test 1, but it sounds like your NFS 
mounting your storage. Is this a “standard” OS level NFS server, or a Ganesha 
based NFS server?

If you’re using “normal” NFS, your nodes write to 1 of your gluster servers 
over the NFS mount, and the gluster server will write it out to all the other 
servers as needed before acknowledging the write as complete, limiting your 
total throughput. This is also true for the read case, the server you’re 
talking to marshals the response from all the servers before sending it along 
to the client.

If you use Ganesha, it may be able to read/write directly from/to all your 
gluster servers, which should improve your performance.

Since you’re using Ovirt, I would recommend you use gluster mounted volumes 
instead of NFS mounts. Even using the fuse mounts currently supported, I get 
better behavior from it because then nodes are still writing to all the gluster 
servers at the same time, which reduces the wait time on the write completions, 
improving throughput over the NFS case. Then you’re ready for native libgfapi 
support when Ovirt enables it, something I’m looking forward to myself.

I also got some performance improvement by setting higher numbers for 
server.event-threads and client.event-threads on my volumes. This is more setup 
& load dependent, so play around with it some.

  -Darrell

> On Apr 3, 2017, at 9:33 AM, Thorsten Schade  
> wrote:
> 
> On my side in has a productive Ovirt Cluster and try to understand my 
> performance issue.
> 
> For history information, I start with Ovirt 3.6 and gluster 3.6 and the test 
> are near the
> same over the version.
> 
> My understanding problem is that if a oivrt server write in an disperse 
> scenario to 4 (6) nodes,
> this should near the performance from a nfs mount - but they aren't!!
> 
> All machines (Gluster and Ovirt) run Centos 7, totally upgrade with newest 
> ML-Kernel
> The network storage backbone is a 10GB net.
> 
> Gluster version 3.8.10 ( 6 Node Servers, 16GB Ram, 4 CPU)
> Oivrt version 4.1  (3 Node Servers, 128GB Ram, 8 CPU)
> 
> 
> Test 1:
> 
> The Gluster - 6 computer, every with a 4TB RED 5400upm data Disk. 
> Simple single performance per Disk:
> Write:  172 MB/s
> 
> Create a Disperse Volume with 4 + 2 supported configuration
> and "group virt" .
> 
> 
> Volume Name: vol01
> Type: Disperse
> Volume ID: ebb831b9-d65d-4583-98d7-f0b262cf124a
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x (4 + 2) = 6
> Transport-type: tcp
> Bricks:
> Brick1: vmw-lix-135:/data/brick1-1/brick01
> Brick2: vmw-lix-136:/data/brick1-1/brick01
> Brick3: vmw-lix-137:/data/brick1-1/brick01
> Brick4: vmw-lix-138:/data/brick1-1/brick01
> Brick5: vmw-lix-139:/data/brick1-1/brick01
> Brick6: vmw-lix-134:/data/brick1-1/brick01
> Options Reconfigured:
> user.cifs: off
> features.shard: on
> cluster.shd-wait-qlength: 1
> cluster.shd-max-threads: 8
> cluster.locking-scheme: granular
> cluster.data-self-heal-algorithm: full
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> cluster.eager-lock: enable
> network.remote-dio: enable
> performance.low-prio-threads: 32
> performance.stat-prefetch: off
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> transport.address-family: inet
> performance.readdir-ahead: on
> nfs.disable: on
> 
> 
> The Gluster has running virtual machines on it, verly low usage
> 
> Performance Test with  dd  10GB Read to /dev/null and Write from /dev/zero on 
> the Ovirt node servers
> to the gluster mount.
> 
> 1 Node dd 10GB multiple test
> write: 80-95 MB/s   (slow)
> read: 70-80 MB/s  (second read same dd file possible up to 800 MB/s - 
> cache?)
> 
> All 3 Nodes dd run concurrent
> write: 80-90 MB/s   (like a single node write, slow per node, concurrent 
> 240MB/s input in the gluster)
> read: 40-55 MB/s   (poor)
> 
> My conclusion, 
> The performance per single write is 80-90MB/s   and read is slower with only 
> 70 MB/s
> Multiple write are like single write, but multiple read is poor.
> 
> Test 2.
> 
> I think I has a problem in my network or with the server, I build all 6 hard 
> disk in one Server
> and create 2 partition per 4TB Disk
> 
> Than in prepare to storages for the Ovirt Cluster.
> The first 6 disk partitions with mdadm to a raid 5 and mount it as nfs data 
> volume in ovirt
> The other 6 disk partition as a disperse volume 4+2
> 
> the disperse gluster volume get  performance like before
> write: 80MB/s
> read: 70 MB/s
> 
> but NFS mount from the mdadm raid:
> 
> singel node dd:
> write: 290 MB/s
> read: 700 MB/s
> 
> 3 nodes concurrent dd to nfs mount:
> write: 125-140 MB/s  ( ~400 MB/s to mdadm write)
> read: 400-700 MB/s   (~ 1600 MB/s from mdadm, near 10GB network speed)
> 
> On the same server and the same disks NFS has a real performance advantage!!!
> 
> The cpu was not a bottleneck during gluster operation, I has a look with htop 
> 

[Gluster-users] Gluster Performance in an Ovirt Scenario.

2017-04-03 Thread Thorsten Schade
On my side in has a productive Ovirt Cluster and try to understand my 
performance issue.

For history information, I start with Ovirt 3.6 and gluster 3.6 and the test 
are near the
same over the version.

My understanding problem is that if a oivrt server write in an disperse 
scenario to 4 (6) nodes,
this should near the performance from a nfs mount - but they aren't!!

All machines (Gluster and Ovirt) run Centos 7, totally upgrade with newest 
ML-Kernel
The network storage backbone is a 10GB net.

Gluster version 3.8.10 ( 6 Node Servers, 16GB Ram, 4 CPU)
Oivrt version 4.1  (3 Node Servers, 128GB Ram, 8 CPU)


Test 1:

The Gluster - 6 computer, every with a 4TB RED 5400upm data Disk. 
Simple single performance per Disk:
Write:  172 MB/s

Create a Disperse Volume with 4 + 2 supported configuration
and "group virt" .


Volume Name: vol01
Type: Disperse
Volume ID: ebb831b9-d65d-4583-98d7-f0b262cf124a
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: vmw-lix-135:/data/brick1-1/brick01
Brick2: vmw-lix-136:/data/brick1-1/brick01
Brick3: vmw-lix-137:/data/brick1-1/brick01
Brick4: vmw-lix-138:/data/brick1-1/brick01
Brick5: vmw-lix-139:/data/brick1-1/brick01
Brick6: vmw-lix-134:/data/brick1-1/brick01
Options Reconfigured:
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 1
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on


The Gluster has running virtual machines on it, verly low usage

Performance Test with  dd  10GB Read to /dev/null and Write from /dev/zero on 
the Ovirt node servers
to the gluster mount.

1 Node dd 10GB multiple test
write: 80-95 MB/s   (slow)
read: 70-80 MB/s  (second read same dd file possible up to 800 MB/s - 
cache?)

All 3 Nodes dd run concurrent
write: 80-90 MB/s   (like a single node write, slow per node, concurrent 
240MB/s input in the gluster)
read: 40-55 MB/s   (poor)

My conclusion, 
The performance per single write is 80-90MB/s   and read is slower with only 70 
MB/s
Multiple write are like single write, but multiple read is poor.

Test 2.

I think I has a problem in my network or with the server, I build all 6 hard 
disk in one Server
and create 2 partition per 4TB Disk

Than in prepare to storages for the Ovirt Cluster.
The first 6 disk partitions with mdadm to a raid 5 and mount it as nfs data 
volume in ovirt
The other 6 disk partition as a disperse volume 4+2

the disperse gluster volume get  performance like before
write: 80MB/s
read: 70 MB/s

but NFS mount from the mdadm raid:

singel node dd:
write: 290 MB/s
read: 700 MB/s

3 nodes concurrent dd to nfs mount:
write: 125-140 MB/s  ( ~400 MB/s to mdadm write)
read: 400-700 MB/s   (~ 1600 MB/s from mdadm, near 10GB network speed)

On the same server and the same disks NFS has a real performance advantage!!!

The cpu was not a bottleneck during gluster operation, I has a look with htop 
during operation.


Can some explain why the gluster volume has not near the performance from the 
nfs mount
on the mdadm raid 5,  or the 6 node gluster test ...

Thanks

Thorsten

___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption

2017-04-03 Thread Mahdi Adnan
Good to hear.
Eagerly waiting for the patch.

Thank you guys.

Get Outlook for Android



From: Krutika Dhananjay 
Sent: Monday, April 3, 2017 11:22:40 AM
To: Pranith Kumar Karampuri
Cc: Mahdi Adnan; gluster-users@gluster.org List; Gowdappa, Raghavendra
Subject: Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption

So Raghavendra has an RCA for this issue.

Copy-pasting his comment here:



Following is a rough algorithm of shard_writev:

1. Based on the offset, calculate the shards touched by current write.
2. Look for inodes corresponding to these shard files in itable.
3. If one or more inodes are missing from itable, issue mknod for corresponding 
shard files and ignore EEXIST in cbk.
4. resume writes on respective shards.

Now, imagine a write which falls to an existing "shard_file". For the sake of 
discussion lets consider a distribute of three subvols - s1, s2, s3

1. "shard_file" hashes to subvolume s2 and is present on s2
2. add a subvolume s4 and initiate a fix layout. The layout of ".shard" is 
fixed to include s4 and hash ranges are changed.
3. write that touches "shard_file" is issued.
4. The inode for "shard_file" is not present in itable after a graph switch and 
features/shard issues an mknod.
5. With new layout of .shard, lets say "shard_file" hashes to s3 and mknod 
(shard_file) on s3 succeeds. But, the shard_file is already present on s2.

So, we have two files on two different subvols of dht representing same shard 
and this will lead to corruption.




Raghavendra will be sending out a patch in DHT to fix this issue.

-Krutika


On Tue, Mar 28, 2017 at 11:49 PM, Pranith Kumar Karampuri 
> wrote:


On Mon, Mar 27, 2017 at 11:29 PM, Mahdi Adnan 
> wrote:

Hi,


Do you guys have any update regarding this issue ?

I do not actively work on this issue so I do not have an accurate update, but 
from what I heard from Krutika and Raghavendra(works on DHT) is: Krutika 
debugged initially and found that the issue seems more likely to be in DHT, 
Satheesaran who helped us recreate this issue in lab found that just fix-layout 
without rebalance also caused the corruption 1 out of 3 times. Raghavendra came 
up with a possible RCA for why this can happen. Raghavendra(CCed) would be the 
right person to provide accurate update.


--

Respectfully
Mahdi A. Mahdi


From: Krutika Dhananjay >
Sent: Tuesday, March 21, 2017 3:02:55 PM
To: Mahdi Adnan
Cc: Nithya Balachandran; Gowdappa, Raghavendra; Susant Palai; 
gluster-users@gluster.org List

Subject: Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption

Hi,

So it looks like Satheesaran managed to recreate this issue. We will be seeking 
his help in debugging this. It will be easier that way.

-Krutika

On Tue, Mar 21, 2017 at 1:35 PM, Mahdi Adnan 
> wrote:

Hello and thank you for your email.
Actually no, i didn't check the gfid of the vms.
If this will help, i can setup a new test cluster and get all the data you need.

Get Outlook for Android


From: Nithya Balachandran
Sent: Monday, March 20, 20:57
Subject: Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption
To: Krutika Dhananjay
Cc: Mahdi Adnan, Gowdappa, Raghavendra, Susant Palai, 
gluster-users@gluster.org List

Hi,

Do you know the GFIDs of the VM images which were corrupted?

Regards,

Nithya

On 20 March 2017 at 20:37, Krutika Dhananjay 
> wrote:

I looked at the logs.

>From the time the new graph (since the add-brick command you shared where 
>bricks 41 through 44 are added) is switched to (line 3011 onwards in 
>nfs-gfapi.log), I see the following kinds of errors:

1. Lookups to a bunch of files failed with ENOENT on both replicas which 
protocol/client converts to ESTALE. I am guessing these entries got migrated to

other subvolumes leading to 'No such file or directory' errors.

DHT and thereafter shard get the same error code and log the following:

 0 [2017-03-17 14:04:26.353444] E [MSGID: 109040] 
[dht-helper.c:1198:dht_migration_complete_check_task] 17-vmware2-dht: 
: failed to lookup the file on 
vmware2-dht [Stale file handle]
  1 [2017-03-17 14:04:26.353528] E [MSGID: 133014] 
[shard.c:1253:shard_common_stat_cbk] 17-vmware2-shard: stat failed: 
a68ce411-e381-46a3-93cd-d2af6a7c3532 [Stale file handle]

which is fine.

2. The other kind are from AFR logging of possible split-brain which I suppose 
are harmless too.
[2017-03-17 14:23:36.968883] W [MSGID: 108008] 
[afr-read-txn.c:228:afr_read_txn] 17-vmware2-replicate-13: Unreadable subvolume 
-1 found with event generation 2 for gfid 

[Gluster-users] Performance testing

2017-04-03 Thread Krist van Besien
Hi All,

I build a Gluster 3.8.4 (RHGS 3.2) cluster for a customer, and I am having
some issue demonstrating that it performs well.

The customer compares it with his old NFS based NAS, and runs FIO to test
workloads.

What I notice is that FIO throughtput is only +-20Mb/s, which is not a lot.
When I do a simple test with dd I easily get 600Mb/s throughput.
In the fio job file the option "direct=1" is used, which bypasses caching.
If we run a fio job with direct=0 the performance goes up a lot, and is
near 600Mb/s as well.

The customer insists that on his old system (that Gluster should replace)
he could get 600Mb/s throughput with fio, with the setting direct=1. and
that he was rather underwhelmed by the performance of Gluster here.

What I need is answers to either:
- Have I overlooked something? I have not really done much tuning yet. Is
there some obvious paremeter I overlooked that could change the results of
a fio performance test?

or:

- Is testing with "direct=1" not really a way to test Gluster, as the cache
is a rather important part of what is needed to make gluster perform?

-- 
Vriendelijke Groet |  Best Regards | Freundliche Grüße | Cordialement
--
Krist van Besien | Senior Architect | Red Hat EMEA Cloud Practice | RHCE |
RHCSA Open Stack
@: kr...@redhat.com | M: +41-79-5936260
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] adding arbiter

2017-04-03 Thread Alessandro Briosi
Il 01/04/2017 04:22, Gambit15 ha scritto:
> As I understand it, only new files will be sharded, but simply
> renaming or moving them may be enough in that case.
>
> I'm interested in the arbiter/sharding bug you've mentioned. Could you
> provide any more details or a link?
>

I think it is triggered only on rebalance.

Though I have still no idea if adding an arbiter afterwards needs
rebalance or not, and as this should only write file refernce (and no
data) on the arbiter, this should not touch anything on the data side. I
though wanted to be sure before doing this on a production environment.

The bug has been discussed in the mailing list. There are a couple of
patches that went into 3.8.10

https://review.gluster.org/#/c/16749/
https://review.gluster.org/#/c/16750/

though I'm not sure this solved the problem or not.

https://bugzilla.redhat.com/show_bug.cgi?id=1387878

If you look at the mailing list archive you can find more information on
this.

Currently I'm not using shardin, though as I'm using gluster to host VM
in case of some problems to one of the hosts healing would require lots
of CPU and time to recover the files.
Sharding should solve this, but I'd rather wait the time for it to heal,
then have to go through a restore from a backup cause there was data
corruption.

Any hint would really be appreciated.

Alessandro
 
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption

2017-04-03 Thread Gandalf Corvotempesta
This is a good news
Is this related to the previously fixed bug?

Il 3 apr 2017 10:22 AM, "Krutika Dhananjay"  ha
scritto:

> So Raghavendra has an RCA for this issue.
>
> Copy-pasting his comment here:
>
> 
>
> Following is a rough algorithm of shard_writev:
>
> 1. Based on the offset, calculate the shards touched by current write.
> 2. Look for inodes corresponding to these shard files in itable.
> 3. If one or more inodes are missing from itable, issue mknod for 
> corresponding shard files and ignore EEXIST in cbk.
> 4. resume writes on respective shards.
>
> Now, imagine a write which falls to an existing "shard_file". For the sake of 
> discussion lets consider a distribute of three subvols - s1, s2, s3
>
> 1. "shard_file" hashes to subvolume s2 and is present on s2
> 2. add a subvolume s4 and initiate a fix layout. The layout of ".shard" is 
> fixed to include s4 and hash ranges are changed.
> 3. write that touches "shard_file" is issued.
> 4. The inode for "shard_file" is not present in itable after a graph switch 
> and features/shard issues an mknod.
> 5. With new layout of .shard, lets say "shard_file" hashes to s3 and mknod 
> (shard_file) on s3 succeeds. But, the shard_file is already present on s2.
>
> So, we have two files on two different subvols of dht representing same shard 
> and this will lead to corruption.
>
> 
>
> Raghavendra will be sending out a patch in DHT to fix this issue.
>
> -Krutika
>
>
> On Tue, Mar 28, 2017 at 11:49 PM, Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>>
>>
>> On Mon, Mar 27, 2017 at 11:29 PM, Mahdi Adnan 
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> Do you guys have any update regarding this issue ?
>>>
>> I do not actively work on this issue so I do not have an accurate update,
>> but from what I heard from Krutika and Raghavendra(works on DHT) is:
>> Krutika debugged initially and found that the issue seems more likely to be
>> in DHT, Satheesaran who helped us recreate this issue in lab found that
>> just fix-layout without rebalance also caused the corruption 1 out of 3
>> times. Raghavendra came up with a possible RCA for why this can happen.
>> Raghavendra(CCed) would be the right person to provide accurate update.
>>
>>>
>>>
>>> --
>>>
>>> Respectfully
>>> *Mahdi A. Mahdi*
>>>
>>> --
>>> *From:* Krutika Dhananjay 
>>> *Sent:* Tuesday, March 21, 2017 3:02:55 PM
>>> *To:* Mahdi Adnan
>>> *Cc:* Nithya Balachandran; Gowdappa, Raghavendra; Susant Palai;
>>> gluster-users@gluster.org List
>>>
>>> *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption
>>>
>>> Hi,
>>>
>>> So it looks like Satheesaran managed to recreate this issue. We will be
>>> seeking his help in debugging this. It will be easier that way.
>>>
>>> -Krutika
>>>
>>> On Tue, Mar 21, 2017 at 1:35 PM, Mahdi Adnan 
>>> wrote:
>>>
 Hello and thank you for your email.
 Actually no, i didn't check the gfid of the vms.
 If this will help, i can setup a new test cluster and get all the data
 you need.

 Get Outlook for Android 

 From: Nithya Balachandran
 Sent: Monday, March 20, 20:57
 Subject: Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption
 To: Krutika Dhananjay
 Cc: Mahdi Adnan, Gowdappa, Raghavendra, Susant Palai,
 gluster-users@gluster.org List

 Hi,

 Do you know the GFIDs of the VM images which were corrupted?

 Regards,

 Nithya

 On 20 March 2017 at 20:37, Krutika Dhananjay 
 wrote:

 I looked at the logs.

 From the time the new graph (since the add-brick command you shared
 where bricks 41 through 44 are added) is switched to (line 3011 onwards in
 nfs-gfapi.log), I see the following kinds of errors:

 1. Lookups to a bunch of files failed with ENOENT on both replicas
 which protocol/client converts to ESTALE. I am guessing these entries got
 migrated to

 other subvolumes leading to 'No such file or directory' errors.

 DHT and thereafter shard get the same error code and log the following:

  0 [2017-03-17 14:04:26.353444] E [MSGID: 109040]
 [dht-helper.c:1198:dht_migration_complete_check_task] 17-vmware2-dht:
 : failed to lookup the
 file on vmware2-dht [Stale file handle]


   1 [2017-03-17 14:04:26.353528] E [MSGID: 133014]
 [shard.c:1253:shard_common_stat_cbk] 17-vmware2-shard: stat failed:
 a68ce411-e381-46a3-93cd-d2af6a7c3532 [Stale file handle]

 which is fine.

 2. The other kind are from AFR logging of possible split-brain which I
 suppose are harmless too.
 [2017-03-17 14:23:36.968883] W [MSGID: 108008]
 [afr-read-txn.c:228:afr_read_txn] 17-vmware2-replicate-13: Unreadable
 subvolume -1 found with event generation 2 for gfid
 

Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption

2017-04-03 Thread Krutika Dhananjay
So Raghavendra has an RCA for this issue.

Copy-pasting his comment here:



Following is a rough algorithm of shard_writev:

1. Based on the offset, calculate the shards touched by current write.
2. Look for inodes corresponding to these shard files in itable.
3. If one or more inodes are missing from itable, issue mknod for
corresponding shard files and ignore EEXIST in cbk.
4. resume writes on respective shards.

Now, imagine a write which falls to an existing "shard_file". For the
sake of discussion lets consider a distribute of three subvols - s1,
s2, s3

1. "shard_file" hashes to subvolume s2 and is present on s2
2. add a subvolume s4 and initiate a fix layout. The layout of
".shard" is fixed to include s4 and hash ranges are changed.
3. write that touches "shard_file" is issued.
4. The inode for "shard_file" is not present in itable after a graph
switch and features/shard issues an mknod.
5. With new layout of .shard, lets say "shard_file" hashes to s3 and
mknod (shard_file) on s3 succeeds. But, the shard_file is already
present on s2.

So, we have two files on two different subvols of dht representing
same shard and this will lead to corruption.



Raghavendra will be sending out a patch in DHT to fix this issue.

-Krutika


On Tue, Mar 28, 2017 at 11:49 PM, Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

>
>
> On Mon, Mar 27, 2017 at 11:29 PM, Mahdi Adnan 
> wrote:
>
>> Hi,
>>
>>
>> Do you guys have any update regarding this issue ?
>>
> I do not actively work on this issue so I do not have an accurate update,
> but from what I heard from Krutika and Raghavendra(works on DHT) is:
> Krutika debugged initially and found that the issue seems more likely to be
> in DHT, Satheesaran who helped us recreate this issue in lab found that
> just fix-layout without rebalance also caused the corruption 1 out of 3
> times. Raghavendra came up with a possible RCA for why this can happen.
> Raghavendra(CCed) would be the right person to provide accurate update.
>
>>
>>
>> --
>>
>> Respectfully
>> *Mahdi A. Mahdi*
>>
>> --
>> *From:* Krutika Dhananjay 
>> *Sent:* Tuesday, March 21, 2017 3:02:55 PM
>> *To:* Mahdi Adnan
>> *Cc:* Nithya Balachandran; Gowdappa, Raghavendra; Susant Palai;
>> gluster-users@gluster.org List
>>
>> *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption
>>
>> Hi,
>>
>> So it looks like Satheesaran managed to recreate this issue. We will be
>> seeking his help in debugging this. It will be easier that way.
>>
>> -Krutika
>>
>> On Tue, Mar 21, 2017 at 1:35 PM, Mahdi Adnan 
>> wrote:
>>
>>> Hello and thank you for your email.
>>> Actually no, i didn't check the gfid of the vms.
>>> If this will help, i can setup a new test cluster and get all the data
>>> you need.
>>>
>>> Get Outlook for Android 
>>>
>>> From: Nithya Balachandran
>>> Sent: Monday, March 20, 20:57
>>> Subject: Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption
>>> To: Krutika Dhananjay
>>> Cc: Mahdi Adnan, Gowdappa, Raghavendra, Susant Palai,
>>> gluster-users@gluster.org List
>>>
>>> Hi,
>>>
>>> Do you know the GFIDs of the VM images which were corrupted?
>>>
>>> Regards,
>>>
>>> Nithya
>>>
>>> On 20 March 2017 at 20:37, Krutika Dhananjay 
>>> wrote:
>>>
>>> I looked at the logs.
>>>
>>> From the time the new graph (since the add-brick command you shared
>>> where bricks 41 through 44 are added) is switched to (line 3011 onwards in
>>> nfs-gfapi.log), I see the following kinds of errors:
>>>
>>> 1. Lookups to a bunch of files failed with ENOENT on both replicas which
>>> protocol/client converts to ESTALE. I am guessing these entries got
>>> migrated to
>>>
>>> other subvolumes leading to 'No such file or directory' errors.
>>>
>>> DHT and thereafter shard get the same error code and log the following:
>>>
>>>  0 [2017-03-17 14:04:26.353444] E [MSGID: 109040]
>>> [dht-helper.c:1198:dht_migration_complete_check_task] 17-vmware2-dht:
>>> : failed to lookup the
>>> file on vmware2-dht [Stale file handle]
>>>
>>>
>>>   1 [2017-03-17 14:04:26.353528] E [MSGID: 133014]
>>> [shard.c:1253:shard_common_stat_cbk] 17-vmware2-shard: stat failed:
>>> a68ce411-e381-46a3-93cd-d2af6a7c3532 [Stale file handle]
>>>
>>> which is fine.
>>>
>>> 2. The other kind are from AFR logging of possible split-brain which I
>>> suppose are harmless too.
>>> [2017-03-17 14:23:36.968883] W [MSGID: 108008]
>>> [afr-read-txn.c:228:afr_read_txn] 17-vmware2-replicate-13: Unreadable
>>> subvolume -1 found with event generation 2 for gfid
>>> 74d49288-8452-40d4-893e-ff4672557ff9. (Possible split-brain)
>>>
>>> Since you are saying the bug is hit only on VMs that are undergoing IO
>>> while rebalance is running (as opposed to those that remained powered off),
>>>
>>> rebalance + IO could be causing some issues.
>>>
>>> CC'ing DHT devs
>>>
>>> Raghavendra/Nithya/Susant,
>>>
>>> Could you