Hi Andrew, On Wed, Sep 07, 2016 at 02:28:47PM -0400, Andrew Rodland wrote: > Hello all, > > === Background material, skip this if you just want to know what I've done === > > For some time I've been wishing for a way to combine the best aspects of > consistent hashing and "balance leastconn". I use haproxy to distribute > traffic > across a cluster of servers doing dynamic video delivery. Without going into > too much detail, there's a substantial cache effect ??? there's a bunch of > information that needs to be calculated for each video, and if that > information is cached, then the request is quite a bit faster and uses less > resources. Also, the server cluster runs in the cloud and auto-scales based > on > load. > > Initially our approach was to cache the information locally on each server, > and use uri-based balancing with consistent hashing to distribute the traffic > among servers according to video ID. This gave a satisfying number of cache > hits, but had the problem that a single server could become overloaded by > receiving the traffic for one or two exceptionally popular videos, resulting > in > bad performance. > > Our second whack was to change to "balance leastconn" (which destroys the > cache hit ratio) and add a second layer of shared cache. This gives more > consistent performance, but the bandwidth used for cache coherency is > substantial, and the cache servers create a single point of failure. > > What I really wanted was a policy that uses consistent hashing most of the > time, but at the same time, avoids overloading any given server, by diverting > traffic from an overloaded server to a less-loaded server, in a consistent > order > based on the consistent hash ring. I did some experimenting with algorithms, > but didn't find a satisfactory one. > > === End of background material, here's the paper and the patch ===
That's an interesting subject. I've been used to advice people to proceed various ways to deal with this, one of them being to use two backends, one running with consistent hash and another one with leastconn or round robin, and to decide which one to use based on the number of total connections, on the current object's concurrent connections or on each server's connection count. But none of these methods was perfect since they wouldn't consider the load caused by the hash-based backend. > Recently I came upon the paper "Consistent Hashing with Bounded Loads", > http://arxiv.org/abs/1608.01350 . It describes a very simple algorithm for > doing exactly what I wanted. It defines a parameter c > 1, and enforces that > no > server receives more than ceil(c * avg(conns)) connections. If a server is > already at the limit, it goes around the hash ring until it finds one that is > below the limit, and assigns the connection to that server. A large "c" > provides stricter hashing (a greater chance of requests with the same hash > key > going to the same server), while a small "c" provides stricter balancing > (less > difference between the smallest and largest number of connections to a > server). I simulated this algorithm, found that it would work better than the > ones I tested before, and then implemented it in HAProxy. On testing in > production I saw reduced 95th-percentile response times, and a great > reduction > in cache-coherency traffic. I'm not surprized at all. The method is interesting. > I'm attaching the patch that I tested. However, I don't expect that it is in > any way mergeable. I'm not very familiar with the HAProxy codebase and I did > a > number of things that are almost certainly wrong. I replaced the existing > chash_get_server_hash instead of attempting to add a new method or flag to > the > configuration. The "c" parameter is hard-coded instead of configurable, as > haproxy currently has no awareness of non-integer numbers, and I wasn't sure > how to deal with that. My code for enumerating servers and assigning > capacities is probably suboptimal, although it didn't cause a noticeable bump > in CPU usage on my production systems. There are probably any number of > coding-style gaffes. And there is at least one bug that causes haproxy to > become unresponsive when servers are marked down. Don't worry, that's the principle of a PoC. I looked at your code. I think we can significantly simplify it. The current server's relative capacity should be computed on the fly when trying to pick the server in the ring. Right now you call this function chash_update_server_capacities() to recompute all servers capacities before chosing your server but given that you already have the information on the number of total served connections in the backend, it's possible to modify the current lookup function to void running this loop first. This is especially important as some setups have very large numbers of servers (eg: up to 1000). Also I've been thinking about this issue of the infinite loop that you solved already. As long as c > 1 I don't think it can happen at all, because for any server having a load strictly greater than the average load, it means there exists at least one server with a load smaller than or equal to the average. Otherwise it means there's no more server in the ring because all servers are down, and then the initial lookup will simply return NULL. Maybe there's an issue with the current lookup method, we'll have to study this. I used to have a concern with the time it can take to find a non-saturated server in the list, but from what I remember, each server appears 16 times per weight unit on the ring. So for the algorithm to take time, it would in fact require a huge imbalance between server weights and many servers. For example you have 99 servers with weight 100 and one server with weight 1. It could in the worst case cause 1600 nodes to be studied before finding the first of the 16 valid ones. But thinking about it, it doesn't make sense at all for the use case this mechanism targets because most of the times, the weights will be well balanced. So it could very well be solved by documentation to recommend users to avoid such setups. For the next steps, I would suggest you to try to proceed this way : 1) modify all places which adjust srv->served to also adjust srv->proxy->served (I thought it was already done but not). This way you always know the amount of connections being served for a given backend. 2) call a function in your while() loop to check each server's eligibility based on the C parameter. We should consider that C==0 implies the mechanism is not used. This would give something approximately like this : while (c && !ch_server_is_eligible(s, c)) { s = next(s); ... } And this function ch_server_is_eligible() would check that the number of served connections is lower than the total number of served connections in the backend divided by the number of servers currently in use : s->served <= c * s->proxy->served / (s->proxy->srv_act ? s->proxy->srv_act : s->proxy->lbprm.fbck ? 1 : s->proxy->srv_bck) 3) add a configuration option. I still don't know if it would be better in the backend or in the server. I'm tempted to think it's better to have it per backend (to avoid the risk that no server it picked) and to let people continue to use the weights to modulate the acceptation ratio as they've always been doing. This last point makes me think that we should probably use "c * s->weight" instead of just "c" so that each server's weight is considered in this ratio. This is quite important otherwise in practice it will not be much possible to apply different weights to servers. Please do not hesitate to indicate if you need more help, if these elements are not clear or whatever. I definitely want to help you get this great work merged in 1.7. Thanks! Willy

