Re: [Gluster-users] GlusterFS Storage Interruption at Node Loss
Nic I believe this is normal expected behaviour. The network timeout is there because it is expensive to tear down the sockets etc. so you only want to do it if a node has really failed and not for some transitory network blip. On 8 July 2016 at 20:29, Nic Seltzerwrote: > Hello list! > > I am experiencing an issue whereby mounted Gluster volumes are being made > read-only until the network timeout interval has passed or the node comes > back online. I have reduced the network timeout to one second and was able > to reduce the size of the outage window to two seconds. I am curious if > anyone else has seen this issue and how they went about resolving it for > their implementation. We are using a distributed-replicated volume, but > have also tested _just_ replicated volume with the same results. I can > provide the gluster volume info if it's helpful, but suffice to say that it > is a pretty simple setup. > > Thanks! > > -- > Nic Seltzer > Esports Ops Tech | Riot Games > Cell: +1.402.431.2642 | NA Summoner: Riot Dankeboop > http://www.riotgames.com > http://www.leagueoflegends.com > > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users > ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] New cluster - first experience
On 9/07/2016 6:07 AM, Gandalf Corvotempesta wrote: With balance-rr you should reach 2000gbit as client is writing to 3 server simultaneously, thus, is using different destinations/connections that are balanced. You wont see the aggregated speed trying to communicate directly between 2 hosts, but in this case, there are 4 hosts involved (1 "client", 3 servers) and thus 4 ips Nope, communications with each server will be limited to 1Gbps max. Its just you will be able to write to each server *simultaneously* at 1Gbps. Whats your network topology? servers are connected via a switch? what brand /model is the switch? does it have bonding setup? Also whats the underlying disk setup - models, raid etc. -- Lindsay Mathieson ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] self healing with sharding
On 8/07/2016 9:40 PM, Gandalf Corvotempesta wrote: How did you mesure the performance? I would like to test in the same way, so that results are comparable. Not particularity scientific. I have four main tests I run 1.CrystalDiskMark in a Windows VM. This lets me see IOPS as experienced by the VM. I'm suspicious of std disk becnhmarks though, they don't really reflect day-day usage. 2.The build server for our enterprise product, a fairly large cmd line build, a real world usage that exercises random read/writes fairly well. 3.Starting up and running std applications - eclipse, Office 365, outlook etc. More subjective, which does matter. 4.Multiple simultaneous VM starts, a good stress test. Which network/hardware/servers topology are you using ? 3 Compute Servers - Combined VM hosts and gluster nodes, for a replica 3 gluster volume VNA: - Dual Xeon E5-2660 2.2Ghz - 64GB EEC Ram - 2 * 1Gb Bond - 4x3TB WD red in ZFS RAID10 VNB, VNG : - Xeon E5-2620 2.0 Ghz - 64GB Ram - 3 * 1Gb Bond - 4x3TB WD red in ZFS RAID10 All Bonds are LACP Balance-tcp with a dedicated Switch. VNA is supposed to have 3*1Gb as well but we had driver problems with the 3rd card and I haven't got round to fixing it :( Internal & external traffic share the bond. External traffic is minimal. -- Lindsay Mathieson ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] GlusterFS Storage Interruption at Node Loss
Hello list! I am experiencing an issue whereby mounted Gluster volumes are being made read-only until the network timeout interval has passed or the node comes back online. I have reduced the network timeout to one second and was able to reduce the size of the outage window to two seconds. I am curious if anyone else has seen this issue and how they went about resolving it for their implementation. We are using a distributed-replicated volume, but have also tested _just_ replicated volume with the same results. I can provide the gluster volume info if it's helpful, but suffice to say that it is a pretty simple setup. Thanks! -- Nic Seltzer Esports Ops Tech | Riot Games Cell: +1.402.431.2642 | NA Summoner: Riot Dankeboop http://www.riotgames.com http://www.leagueoflegends.com ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] New cluster - first experience
2016-07-08 21:53 GMT+02:00 Alastair Neil: > Also remember with a single transfer you will not see 2000 gb/s only 1000 > gb/s With balance-rr you should reach 2000gbit as client is writing to 3 server simultaneously, thus, is using different destinations/connections that are balanced. You wont see the aggregated speed trying to communicate directly between 2 hosts, but in this case, there are 4 hosts involved (1 "client", 3 servers) and thus 4 ips ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] New cluster - first experience
Also remember with a single transfer you will not see 2000 gb/s only 1000 gb/s On 8 July 2016 at 15:14, Gandalf Corvotempesta < gandalf.corvotempe...@gmail.com> wrote: > 2016-07-08 20:43 GMT+02:00: > > Gluster, and in particular the fuse mounter, do not operate on small > file workloads anywhere near wire speed in their current arch. > > I know that i'll unable to reach wire speed, but with 2000gbit > available, reaching only 88mbit with 1GB file is really low. > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users > ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] New cluster - first experience
2016-07-08 20:43 GMT+02:00: > Gluster, and in particular the fuse mounter, do not operate on small file > workloads anywhere near wire speed in their current arch. I know that i'll unable to reach wire speed, but with 2000gbit available, reaching only 88mbit with 1GB file is really low. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] New cluster - first experience
2016-07-08 20:35 GMT+02:00 Joe Julian: > Assuming this dd was run from one of the servers to a replica 3 volume, you > have one localhost write and two network writes for 88 Mbit/s which looks > like the maxing out of a 100Mbit connection. That is so coincidental it > would lead me to look at the network. Exactly but with iperf i'm able to reach about 1.70gbit on a dual gigabit bond. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] New cluster - first experience
Assuming this dd was run from one of the servers to a replica 3 volume, you have one localhost write and two network writes for 88 Mbit/s which looks like the maxing out of a 100Mbit connection. That is so coincidental it would lead me to look at the network. On 07/08/2016 11:23 AM, Gandalf Corvotempesta wrote: 2016-07-08 10:55 GMT+02:00 Gandalf Corvotempesta: Now i'm using a bonded gigabit (2x1GB) on every server but i'm still stuck at about 15-30mbit/s when extracting the Linux kernel. Total extraction is still about 10minutes. Something strange is going on, on a dual gigabit connection (2000mbit) I'm expecting to reach hundreds of megabit (200-300), not tens (15-20) # dd if=/dev/zero of=/mnt/glusterfs/zero4 bs=1M count=1000 1000+0 record dentro 1000+0 record fuori 1048576000 byte (1,0 GB) copiati, 88,2981 s, 11,9 MB/s No suggestions? ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] New cluster - first experience
2016-07-08 10:55 GMT+02:00 Gandalf Corvotempesta: > Now i'm using a bonded gigabit (2x1GB) on every server but i'm still > stuck at about 15-30mbit/s when extracting the Linux kernel. Total > extraction is still about 10minutes. > Something strange is going on, on a dual gigabit connection (2000mbit) > I'm expecting to reach hundreds of megabit (200-300), not tens (15-20) > > # dd if=/dev/zero of=/mnt/glusterfs/zero4 bs=1M count=1000 > 1000+0 record dentro > 1000+0 record fuori > 1048576000 byte (1,0 GB) copiati, 88,2981 s, 11,9 MB/s No suggestions? ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] One client can effectively hang entire gluster array
> In either of these situations, one glusterfsd process on whatever peer the > client is currently talking to will skyrocket to *nproc* cpu usage (800%, > 1600%) and the storage cluster is essentially useless; all other clients > will eventually try to read or write data to the overloaded peer and, when > that happens, their connection will hang. Heals between peers hang because > the load on the peer is around 1.5x the number of cores or more. This occurs > in either gluster 3.6 or 3.7, is very repeatable, and happens much too > frequently. I have some good news and some bad news. The good news is that features to address this are already planned for the 4.0 release. Primarily I'm referring to QoS enhancements, some parts of which were already implemented for the bitrot daemon. I'm still working out the exact requirements for this as a general facility, though. You can help! :) Also, some of the work on "brick multiplexing" (multiple bricks within one glusterfsd process) should help to prevent the thrashing that causes a complete freeze-up. Now for the bad news. Did I mention that these are 4.0 features? 4.0 is not near term, and not getting any nearer as other features and releases keep "jumping the queue" to absorb all of the resources we need for 4.0 to happen. Not that I'm bitter or anything. ;) To address your more immediate concerns, I think we need to consider more modest changes that can be completed in more modest time. For example: * The load should *never* get to 1.5x the number of cores. Perhaps we could tweak the thread-scaling code in io-threads and epoll to check system load and not scale up (or even scale down) if system load is already high. * We might be able to tweak io-threads (which already runs on the bricks and already has a global queue) to schedule requests in a fairer way across clients. Right now it executes them in the same order that they were read from the network. That tends to be a bit "unfair" and that should be fixed in the network code, but that's a much harder task. These are only weak approximations of what we really should be doing, and will be doing in the long term, but (without making any promises) they might be sufficient and achievable in the near term. Thoughts? ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] One client can effectively hang entire gluster array
Hello, users and devs. TL;DR: One gluster client can essentially cause denial of service / availability loss to entire gluster array. There's no way to stop it and almost no way to find the bad client. Probably all (at least 3.6 and 3.7) versions are affected. We have two large replicate gluster arrays (3.6.6 and 3.7.11) that are used in a high-performance computing environment. Two file access cases cause severe issues with glusterfs: Some of our scientific codes write hundreds of files (~400-500) simultaneously (one file or more per processor core, so lots of small or large writes) and others read thousands of files (2000-3000) simultaneously to grab metadata from each file (lots of small reads). In either of these situations, one glusterfsd process on whatever peer the client is currently talking to will skyrocket to *nproc* cpu usage (800%, 1600%) and the storage cluster is essentially useless; all other clients will eventually try to read or write data to the overloaded peer and, when that happens, their connection will hang. Heals between peers hang because the load on the peer is around 1.5x the number of cores or more. This occurs in either gluster 3.6 or 3.7, is very repeatable, and happens much too frequently. Even worse, there seems to be no definitive way to diagnose which client is causing the issues. Getting 'volume status <> clients' doesn't help because it reports the total number of bytes read/written by each client. (a) The metadata in question is tiny compared to the multi-gigabyte output files being dealt with and (b) the byte-count is cumulative for the clients and the compute nodes are always up with the filesystems mounted, so the byte transfer counts are astronomical. The best solution I've come up with is to blackhole-route traffic from clients one at a time (effectively push the traffic over to the other peer), wait a few minutes for all of the backlogged traffic to dissipate (if it's going to), see if the load on glusterfsd drops, and repeat until I find the client causing the issue. I would *love* any ideas on a better way to find rogue clients. More importantly, though, there must be some feature envorced to stop one user from having the capability to render the entire filesystem unavailable for all other users. In the worst case, I would even prefer a gluster volume option that simply disconnects clients making over some threshold of file open requests. That's WAY more preferable than a complete availability loss reminiscent of a DDoS attack... Apologies for the essay and looking forward to any help you can provide. Thanks, Patrick ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] replace brick in distributed-dispersed setup
Hi Ashish, It was an error on my side, nothing gluster related. The kernel version I was running had a bug that prevented the fuse module from loading and causing the brick replacement errors. After upgrading I can confirm that the process of replacing the brick works fine both with the brick to be replaced online or after killing the brick process. I have a question though. In a real (non virtual) server, if I setup the server in JBOD mode, and a drive fails, does gluster kill the brick pid? Regards, Iñaki. On 07/08/2016 07:03 AM, Ashish Pandey wrote: Hi Iñaki The steps you are following don't have any issue. I would like to have more information to debug this further. 1 - gluster v info 2 - gluster v status before and after running replace-brick 3 - Brick logs (for this volume only) from /var/log/glusterfs/bricks/ 4 - glusterd logs /var/log/glusterfs/ starts with "usr-local-etc-glusterfs-glusterd-" Although it should not matter, could you also try to replace a brick without killing that brick process? Ashish *From: *"itlinux_team"*To: *gluster-users@gluster.org *Sent: *Wednesday, July 6, 2016 4:33:54 PM *Subject: *[Gluster-users] replace brick in distributed-dispersed setup Hi all, I'm doing some testings with glusterfs in a virtualized environment running a 3 x (8 + 4) distributed-dispersed volume simulating a 3 node cluster with 12 drives per node configuration. The system versions are: OS: Debian jessie kernel 3.16 Gluster: 3.8.0-2 installed from the gluster.org debian repository I have tested the node failure scenario while some clients are running some read/write operations and the setup works as expected. Now I'm trying to test how to replace a faulty drive on this setup, however I'm not able to replace a brick. To test it I have: 1: Find the pid of the brick I'd like to 'fail' and kill the process. (tried removing the drive from the host but that would make the whole guest unresponsive) 2: Attach a new virtual drive, format and mount it 3: Try the gluster volume replace-brick command And I'm getting the following error: gluster volume replace-brick vol_1 glusterserver1:/ext/bricks/brick-1 glusterserver1:/ext/bricks/brick-13 commit force volume replace-brick: failed: Fuse unavailable Replace-brick failed I assume I'm doing something wrong but don't know what exactly. Looking in the documentation I have not found information about brick replacement in distributed-dispersed setups. Thanks! Iñaki ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Deleting data on glusterfs replicate volume take more time.
Hi All, I have replicate gluster volume of 400gb, it has two bricks from two servers each has xfs fs of 400gb. i want to delete 200gb of data in the gluster volume, i started delete the data, its taking more time. even for delete 1gb of data. can you please guide me, any volume tunning option need to set. Thanks Veera. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] self healing with sharding
2016-07-08 13:19 GMT+02:00 Lindsay Mathieson: > Depends on what performance measures you use. In my ad hoc testing, my > impressions where that large seq read/write slowed down as shard size > shrunk, but oddly, random i/o actually increased. How did you mesure the performance? I would like to test in the same way, so that results are comparable. Which network/hardware/servers topology are you using ? ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] self healing with sharding
2016-07-08 11:23 GMT+02:00 Kevin Lemonnier: > No, only the shards that were modified during the downtime of the node > will need to be healed. It is MUCH quicker than healing the whole > VM file without sharding, and shouldn't provoke a freez of the VM > because of locking. This is not clear to me, due to my low understanding of gluster. Let's assume a 100GB virtual machine image (qcow2 or whatelse) with 64MB shards One users does this: "touch /tmp/test" inside this virtual machine, during the node downtime. Gluster will "update" only the involved shard, right ? Thus, when node comes back, only that single shard must be healed, maybe a 64MB shard on a 100GB image. Without sharding, a single "touch" would require the whole 100GB to be sharded ? What happens to the virtual machine, during the shard healing? Only files included in that shard are set as read-only? Or all files included in that shard are hidden ? What if a file is spread across multiple shards and one of these shards need healing ? ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] self healing with sharding
No, only the shards that were modified during the downtime of the node will need to be healed. It is MUCH quicker than healing the whole VM file without sharding, and shouldn't provoke a freez of the VM because of locking. On Fri, Jul 08, 2016 at 10:41:16AM +0200, Gandalf Corvotempesta wrote: > Let's assume a 3 node cluster with replica 3 and a huge file (1GB) > with shard size of 100MB > > Gluster automatically create 10 chunks for the file. > > In a 3 node cluster with replica 3, all chunks are on every server. > > A node dies. > > When the node comes back online, self healing is triggered. > In this case, the whole file would be healed, as all shards must be > replicated, right ? > Any advantage with this over a standard configuration without shard? > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users -- Kevin Lemonnier PGP Fingerprint : 89A5 2283 04A0 E6E9 0111 signature.asc Description: Digital signature ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] lingering <gfid:*> entries in volume heal, gluster 3.6.3
Hi, One of our bricks was offline for a few days when it didn't reboot after a yum update (the gluster version wasn't changed). The volume heal info is showing the same 129 entries, all of the format on the 3 bricks that remained up, and no entries on the brick that was offline. glustershd.log on the brick that was offline has stuff like this in it: [2016-07-08 08:54:07.411486] I [client-handshake.c:1200:client_setvolume_cbk] 0-gv0-client-1: Connected to gv0-client-1, attached to remote volume '/data/brick/gv0'. [2016-07-08 08:54:07.411493] I [client-handshake.c:1210:client_setvolume_cbk] 0-gv0-client-1: Server and Client lk-version numbers are not same, reopening the fds [2016-07-08 08:54:07.411678] I [client-handshake.c:188:client_set_lk_version_cbk] 0-gv0-client-1: Server lk version = 1 [2016-07-08 08:54:07.793661] I [client-handshake.c:1200:client_setvolume_cbk] 0-gv0-client-3: Connected to gv0-client-3, attached to remote volume '/data/brick/gv0'. [2016-07-08 08:54:07.793688] I [client-handshake.c:1210:client_setvolume_cbk] 0-gv0-client-3: Server and Client lk-version numbers are not same, reopening the fds [2016-07-08 08:54:07.794091] I [client-handshake.c:188:client_set_lk_version_cbk] 0-gv0-client-3: Server lk version = 1 but glustershd.log on the other 3 bricks has many lines looking like this: [2016-07-08 09:05:17.203017] W [client-rpc-fops.c:2772:client3_3_lookup_cbk] 0-gv0-client-3: remote operation failed: No such file or directory. Path: (81dc9194-2379-40b5-a949-f7550433b2e0) [2016-07-08 09:05:17.203405] W [client-rpc-fops.c:2772:client3_3_lookup_cbk] 0-gv0-client-0: remote operation failed: No such file or directory. Path: (b1e273ad-9eb1-4f97-a41c-39eecb149bd6) [2016-07-08 09:05:17.204035] W [client-rpc-fops.c:2772:client3_3_lookup_cbk] 0-gv0-client-0: remote operation failed: No such file or directory. Path: (436dcbec-a12a-4df9-b8ef-bae977c98537) [2016-07-08 09:05:17.204225] W [client-rpc-fops.c:2772:client3_3_lookup_cbk] 0-gv0-client-1: remote operation failed: No such file or directory. Path: (436dcbec-a12a-4df9-b8ef-bae977c98537) [2016-07-08 09:05:17.204651] W [client-rpc-fops.c:2772:client3_3_lookup_cbk] 0-gv0-client-0: remote operation failed: No such file or directory. Path: (08713e43-7bcb-43f3-818a-7b062abd6e95) [2016-07-08 09:05:17.204879] W [client-rpc-fops.c:2772:client3_3_lookup_cbk] 0-gv0-client-1: remote operation failed: No such file or directory. Path: (08713e43-7bcb-43f3-818a-7b062abd6e95) [2016-07-08 09:05:17.205042] W [client-rpc-fops.c:2772:client3_3_lookup_cbk] 0-gv0-client-3: remote operation failed: No such file or directory. Path: (08713e43-7bcb-43f3-818a-7b062abd6e95) How do I fix this? I need to update the other bricks but am reluctant to do so until the volume is in good shape first. We're running Gluster 3.6.3 on CentOS 7. Volume info: Volume Name: callrec Type: Replicate Volume ID: a39830b7-eddb-4061-b381-39411274131a Status: Started Number of Bricks: 1 x 4 = 4 Transport-type: tcp Bricks: Brick1: gluster1a-1:/data/brick/callrec Brick2: gluster1b-1:/data/brick/callrec Brick3: gluster2a-1:/data/brick/callrec Brick4: gluster2b-1:/data/brick/callrec Options Reconfigured: performance.flush-behind: off -- Cheers, Kingsley. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] New cluster - first experience
2016-07-08 10:55 GMT+02:00 Gandalf Corvotempesta: > # gluster volume info gv0 > > Volume Name: gv0 > Type: Replicate > Volume ID: 2a36dc0f-1d9b-469c-82de-9d8d98321b83 > Status: Started > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: 77.95.175.112:/export/sdb1/brick > Brick2: 77.95.175.113:/export/sdb1/brick > Brick3: 77.95.175.114:/export/sdb1/brick > Options Reconfigured: > performance.cache-size: 2GB > performance.write-behind-window-size: 2GB > features.shard-block-size: 64MB > features.shard: on > transport.address-family: inet > performance.readdir-ahead: on > nfs.disable: on Bond configuration on each server: # cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: load balancing (round-robin) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 200 Down Delay (ms): 200 Slave Interface: eth0 MII Status: up Speed: 1000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 00:25:90:c8:a0:6c Slave queue ID: 0 Slave Interface: eth1 MII Status: up Speed: 1000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 00:25:90:c8:a0:6d Slave queue ID: 0 ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] self healing with sharding
Let's assume a 3 node cluster with replica 3 and a huge file (1GB) with shard size of 100MB Gluster automatically create 10 chunks for the file. In a 3 node cluster with replica 3, all chunks are on every server. A node dies. When the node comes back online, self healing is triggered. In this case, the whole file would be healed, as all shards must be replicated, right ? Any advantage with this over a standard configuration without shard? ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users