If you are not using GlusterFS NFS services, have you tried linux tc or traffic isolation?

If you put traffic control on inter-node traffic, that will limit the rebalance/self-heal IO or you can move inter-node traffic to its own network interface, using routing and /etc/hosts entries.

I would expect that one of the issues in controlling the rebalance/self-heal IO at the glusterfsd level is hooking into the kernel for traffic information on the interface it is routing through. Since both activities are push based, the receiver needs to know its full IO picture via the kernel and push back accordingly.

You don't want to make a static setting, as that will limit your rebalance/self-heal too low at idle times and won't back off enough during high load times. So, you need it to be dynamic based on "available" IO.

This is definitely a "not easy" problem to solve.

On 05/16/13 02:54, Hans Lambermont wrote:
Hi all,

My production setup also suffers from total unavailablility outages when
self-heal gets real work to do. On a 4 server distributed-replicate 14x2
cluster where 1 server has been down for 2 days the volume becomes
completely unresponsive when we bring the server back into the cluster.

I ticketed it here : https://bugzilla.redhat.com/show_bug.cgi?id=963223
"Re-inserting a server in a v3.3.2qa2 distributed-replicate volume DOSes
the volume"

Does anyone know of a way to slow down self-heal so that it does not
make the volume unresponsive ?


The "unavailability due to high load caused by gluster itself" pattern
repeats itself in several cases :

https://bugzilla.redhat.com/show_bug.cgi?id=950024 replace-brick
immediately saturates IO on source brick causing the entire volume to be
unavailable, then dies

https://bugzilla.redhat.com/show_bug.cgi?id=950006 replace-brick
activity dies, destination glusterfs spins at 100% CPU forever

https://bugzilla.redhat.com/show_bug.cgi?id=832609 Glusterfsd hangs if
brick filesystem becomes unresponsive, causing all clients to lock up

https://bugzilla.redhat.com/show_bug.cgi?id=962875 Entire volume DOSes
itself when a node reboots and runs fsck on its bricks while network is up

https://bugzilla.redhat.com/show_bug.cgi?id=963223 Re-inserting a server
in a v3.3.2qa2 distributed-replicate volume DOSes the volume

There's probably more, but these are the ones that affected my servers.

I also had to stop a rebalance action due to too high load on the above
3 out-of 4 servers cluster causing another service unavailablility
outage. This might be related to 1 server being down as rebalance
'behaved' better before. I made no ticket for this yet.

The pattern must really be fixed, rather sooner than later, as it makes
running a production level service with gluster impossible.

regards,
    Hans Lambermont

--
Mr. Flibble
King of the Potato People
_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Reply via email to