Thanks for the info Joe! responded in line On Mon, Aug 31, 2015 at 3:00 PM, Joe Julian <[email protected]> wrote:
> On 08/31/2015 12:35 PM, Grant Ridder wrote: > >> Hi, >> >> I am testing out several failure scenarios with GlusterFS. I have a 3 >> node replicated gluster that i am testing with. >> >> One test i am having trouble solving is when a host dies. (i.e. as if >> someone pulled the power cord out). >> - Firewall off a host from the rest of the cluster >> - Test time it takes for a fuse mount to respond once the iptables rule >> is added >> >> With the default settings, the mount hangs for 39 seconds. If i change >> ping-timeout to 5 then the mount only hangs for 9.3 seconds. Is there >> anyway to eliminate or get the hang time to a negligible value (less than 1 >> second)? >> >> I have not seem much about handling GlusterFS failure scenarios with my >> Googling around. >> >> Several blog posts i have looked at: >> >> http://thornelabs.net/2015/02/24/change-gluster-volume-connection-timeout-for-glusterfs-native-client.html >> >> https://joejulian.name/blog/keeping-your-vms-from-going-read-only-when-encountering-a-ping-timeout-in-glusterfs/ >> >> Thanks, >> Grant >> >> If a client disconnects from a server, you have to reestablish all the > file descriptors and synchronize the locks when the client reconnects. This > can be pretty expensive and there's no way to avoid it. To balance that, > you don't want your clients to disconnect from the servers if a packet is > lost or takes too long to get a response. That's why the connections are > TCP, to help mitigate that, and why the client waits for some ping-timeout. > If it was too short, even server load could trigger a disconnection which > would be followed by high server load as the connection was reestablished, > potentially causing a disconnection again. > > Pulled power cords or complete system failures, should be a very rare > occurrence. This unfortunately isn't the case with AWS EC2 instances. > Typically this is much rarer than a temporary network issue which is much > more likely to be mitigated in the network fabric and is transient enough > to allow the ping-timeout to hold the connection long enough to avoid the > reestablishment of FDs and locks. It's also only an issue if your file hash > hits that specific replica out of the dht set (or the file doesn't exist). > If you're using a cluster where server failure is frequent enough to be an > issue, your dht distribution lowers the likelihood of the file being hit to > an insignificant statistic. > > If you're working with reasonably resilient hardware, you should easily be > able to engineer for 5 or 6 nines even with a 42 second ping-timeout. > What settings do people normally use for EC2? I have to account for instances dying and not having the clients have a significant impact. 42 seconds is a VERY long time for a data store to be unavailable. > _______________________________________________ > Gluster-users mailing list > [email protected] > http://www.gluster.org/mailman/listinfo/gluster-users >
_______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
