Re: [Gluster-devel] Sick but still "alive" nodes

Jeff Darcy Fri, 25 Jan 2013 05:43:57 -0800

On 01/25/2013 07:47 AM, jayunit...@gmail.com wrote:

Hi guys: I just saw an issue on the Hfds mailing list that might be a
potential problem in gluster clusters.  It kind of reminds me of
Jeff's idea of bricks as first class objects in the API.


What happens if a gluster brick is on a machine which, although still
alive, performs poorly?

would such scenarios be detected and if so, can the brick be
decommissioned/ignored/moved ? If not it would be a cool feature to
have because I'm sure it happens from time to time.

There's nothing currently in place to detect such a condition, and ofcourse if we can't detect it we can't do anything about it. There arealso several cases where we might actually manage to make things worseif we try to do this ourselves. For example, consider the case wherethe slowness is because of a short-duration contending activity. Wemight well react just as that activity subsides, suspending that brickjust as another brick is "going bad" due to similar transient activitythere. Similarly, if the system overall is truly overloaded, suspendingbricks is a bit like squeezing a water balloon - the "bulge" justreappears elsewhere and all we've done is diminish total resourcesavailable.

I've seen problems like this with other parallel filesystems, and I'mpretty sure I've read papers about them too. IMO the right place todeal with such issues is at the job-scheduler or similar level, wheremore of the total system state is known. What we can do is provide moreinformation about our part of the system state, plus levers that theycan pull when they decide that preparation or correction for ahigher-level event (that we probably don't even know about) is appropriate.


_______________________________________________
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Sick but still "alive" nodes

Reply via email to