If you are not using GlusterFS NFS services, have you tried linux tc or
traffic isolation?
If you put traffic control on inter-node traffic, that will limit the
rebalance/self-heal IO or you can move inter-node traffic to its own
network interface, using routing and /etc/hosts entries.
I would expect that one of the issues in controlling the
rebalance/self-heal IO at the glusterfsd level is hooking into the
kernel for traffic information on the interface it is routing through.
Since both activities are push based, the receiver needs to know its
full IO picture via the kernel and push back accordingly.
You don't want to make a static setting, as that will limit your
rebalance/self-heal too low at idle times and won't back off enough
during high load times. So, you need it to be dynamic based on
"available" IO.
This is definitely a "not easy" problem to solve.
On 05/16/13 02:54, Hans Lambermont wrote:
Hi all,
My production setup also suffers from total unavailablility outages when
self-heal gets real work to do. On a 4 server distributed-replicate 14x2
cluster where 1 server has been down for 2 days the volume becomes
completely unresponsive when we bring the server back into the cluster.
I ticketed it here : https://bugzilla.redhat.com/show_bug.cgi?id=963223
"Re-inserting a server in a v3.3.2qa2 distributed-replicate volume DOSes
the volume"
Does anyone know of a way to slow down self-heal so that it does not
make the volume unresponsive ?
The "unavailability due to high load caused by gluster itself" pattern
repeats itself in several cases :
https://bugzilla.redhat.com/show_bug.cgi?id=950024 replace-brick
immediately saturates IO on source brick causing the entire volume to be
unavailable, then dies
https://bugzilla.redhat.com/show_bug.cgi?id=950006 replace-brick
activity dies, destination glusterfs spins at 100% CPU forever
https://bugzilla.redhat.com/show_bug.cgi?id=832609 Glusterfsd hangs if
brick filesystem becomes unresponsive, causing all clients to lock up
https://bugzilla.redhat.com/show_bug.cgi?id=962875 Entire volume DOSes
itself when a node reboots and runs fsck on its bricks while network is up
https://bugzilla.redhat.com/show_bug.cgi?id=963223 Re-inserting a server
in a v3.3.2qa2 distributed-replicate volume DOSes the volume
There's probably more, but these are the ones that affected my servers.
I also had to stop a rebalance action due to too high load on the above
3 out-of 4 servers cluster causing another service unavailablility
outage. This might be related to 1 server being down as rebalance
'behaved' better before. I made no ticket for this yet.
The pattern must really be fixed, rather sooner than later, as it makes
running a production level service with gluster impossible.
regards,
Hans Lambermont
--
Mr. Flibble
King of the Potato People
_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users