On Fri, Jul 8, 2016 at 8:02 PM, Jeff Darcy <[email protected]> wrote:
> > In either of these situations, one glusterfsd process on whatever peer > the > > client is currently talking to will skyrocket to *nproc* cpu usage (800%, > > 1600%) and the storage cluster is essentially useless; all other clients > > will eventually try to read or write data to the overloaded peer and, > when > > that happens, their connection will hang. Heals between peers hang > because > > the load on the peer is around 1.5x the number of cores or more. This > occurs > > in either gluster 3.6 or 3.7, is very repeatable, and happens much too > > frequently. > > I have some good news and some bad news. > > The good news is that features to address this are already planned for the > 4.0 release. Primarily I'm referring to QoS enhancements, some parts of > which were already implemented for the bitrot daemon. I'm still working > out the exact requirements for this as a general facility, though. You > can help! :) Also, some of the work on "brick multiplexing" (multiple > bricks within one glusterfsd process) should help to prevent the thrashing > that causes a complete freeze-up. > > Now for the bad news. Did I mention that these are 4.0 features? 4.0 is > not near term, and not getting any nearer as other features and releases > keep "jumping the queue" to absorb all of the resources we need for 4.0 > to happen. Not that I'm bitter or anything. ;) To address your more > immediate concerns, I think we need to consider more modest changes that > can be completed in more modest time. For example: > > * The load should *never* get to 1.5x the number of cores. Perhaps we > could tweak the thread-scaling code in io-threads and epoll to check > system load and not scale up (or even scale down) if system load is > already high. > > * We might be able to tweak io-threads (which already runs on the > bricks and already has a global queue) to schedule requests in a > fairer way across clients. Right now it executes them in the > same order that they were read from the network. This sounds to be an easier fix. We can make io-threads to factor in another input i.e., the client through which request came in (essentially frame->root->client) before scheduling. That should make the problem bearable at-least if not crippling. As to what algorithm to use, I think we can consider leaky bucket of bit-rot implementation or dmclock. I've not really thought deeper about the algorithm part. If the approach sounds ok, we can discuss more about algos. That tends to > be a bit "unfair" and that should be fixed in the network code, > but that's a much harder task. > > These are only weak approximations of what we really should be doing, > and will be doing in the long term, but (without making any promises) > they might be sufficient and achievable in the near term. Thoughts? > _______________________________________________ > Gluster-devel mailing list > [email protected] > http://www.gluster.org/mailman/listinfo/gluster-devel > -- Raghavendra G
_______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
