Hi Andrew,

On Wed, Sep 07, 2016 at 02:28:47PM -0400, Andrew Rodland wrote:
> Hello all,
> 
> === Background material, skip this if you just want to know what I've done ===
> 
> For some time I've been wishing for a way to combine the best aspects of 
> consistent hashing and "balance leastconn". I use haproxy to distribute 
> traffic 
> across a cluster of servers doing dynamic video delivery. Without going into 
> too much detail, there's a substantial cache effect ??? there's a bunch of 
> information that needs to be calculated for each video, and if that 
> information is cached, then the request is quite a bit faster and uses less 
> resources. Also, the server cluster runs in the cloud and auto-scales based 
> on 
> load.
> 
> Initially our approach was to cache the information locally on each server, 
> and use uri-based balancing with consistent hashing to distribute the traffic 
> among servers according to video ID. This gave a satisfying number of cache 
> hits, but had the problem that a single server could become overloaded by 
> receiving the traffic for one or two exceptionally popular videos, resulting 
> in 
> bad performance.
> 
> Our second whack was to change to "balance leastconn" (which destroys the 
> cache hit ratio) and add a second layer of shared cache. This gives more 
> consistent performance, but the bandwidth used for cache coherency is 
> substantial, and the cache servers create a single point of failure.
> 
> What I really wanted was a policy that uses consistent hashing most of the 
> time, but at the same time, avoids overloading any given server, by diverting 
> traffic from an overloaded server to a less-loaded server, in a consistent 
> order 
> based on the consistent hash ring. I did some experimenting with algorithms, 
> but didn't find a satisfactory one.
> 
> === End of background material, here's the paper and the patch ===

That's an interesting subject. I've been used to advice people to proceed
various ways to deal with this, one of them being to use two backends, one
running with consistent hash and another one with leastconn or round robin,
and to decide which one to use based on the number of total connections, on
the current object's concurrent connections or on each server's connection
count. But none of these methods was perfect since they wouldn't consider
the load caused by the hash-based backend.

> Recently I came upon the paper "Consistent Hashing with Bounded Loads", 
> http://arxiv.org/abs/1608.01350 . It describes a very simple algorithm for 
> doing exactly what I wanted. It defines a parameter c > 1, and enforces that 
> no 
> server receives more than ceil(c * avg(conns)) connections. If a server is 
> already at the limit, it goes around the hash ring until it finds one that is 
> below the limit, and assigns the connection to that server. A large "c" 
> provides stricter hashing (a greater chance of requests with the same hash 
> key 
> going to the same server), while a small "c" provides stricter balancing 
> (less 
> difference between the smallest and largest number of connections to a 
> server). I simulated this algorithm, found that it would work better than the 
> ones I tested before, and then implemented it in HAProxy. On testing in 
> production I saw reduced 95th-percentile response times, and a great 
> reduction 
> in cache-coherency traffic.

I'm not surprized at all. The method is interesting.

> I'm attaching the patch that I tested. However, I don't expect that it is in 
> any way mergeable. I'm not very familiar with the HAProxy codebase and I did 
> a 
> number of things that are almost certainly wrong. I replaced the existing 
> chash_get_server_hash instead of attempting to add a new method or flag to 
> the 
> configuration. The "c" parameter is hard-coded instead of configurable, as 
> haproxy currently has no awareness of non-integer numbers, and I wasn't sure 
> how to deal with that. My code for enumerating servers and assigning 
> capacities is probably suboptimal, although it didn't cause a noticeable bump 
> in CPU usage on my production systems. There are probably any number of 
> coding-style gaffes. And there is at least one bug that causes haproxy to 
> become unresponsive when servers are marked down.

Don't worry, that's the principle of a PoC. I looked at your code. I think
we can significantly simplify it. The current server's relative capacity
should be computed on the fly when trying to pick the server in the ring.
Right now you call this function chash_update_server_capacities() to
recompute all servers capacities before chosing your server but given that
you already have the information on the number of total served connections
in the backend, it's possible to modify the current lookup function to
void running this loop first. This is especially important as some setups
have very large numbers of servers (eg: up to 1000).

Also I've been thinking about this issue of the infinite loop that you
solved already. As long as c > 1 I don't think it can happen at all,
because for any server having a load strictly greater than the average
load, it means there exists at least one server with a load smaller than
or equal to the average. Otherwise it means there's no more server in
the ring because all servers are down, and then the initial lookup will
simply return NULL. Maybe there's an issue with the current lookup
method, we'll have to study this.

I used to have a concern with the time it can take to find a non-saturated
server in the list, but from what I remember, each server appears 16 times
per weight unit on the ring. So for the algorithm to take time, it would
in fact require a huge imbalance between server weights and many servers.
For example you have 99 servers with weight 100 and one server with weight
1. It could in the worst case cause 1600 nodes to be studied before finding
the first of the 16 valid ones. But thinking about it, it doesn't make sense
at all for the use case this mechanism targets because most of the times,
the weights will be well balanced. So it could very well be solved by
documentation to recommend users to avoid such setups.

For the next steps, I would suggest you to try to proceed this way :

1) modify all places which adjust srv->served to also adjust
   srv->proxy->served (I thought it was already done but not). This
   way you always know the amount of connections being served for a
   given backend.

2) call a function in your while() loop to check each server's eligibility
   based on the C parameter. We should consider that C==0 implies the
   mechanism is not used. This would give something approximately like
   this :

   while (c && !ch_server_is_eligible(s, c)) {
      s = next(s);
      ...
   }

   And this function ch_server_is_eligible() would check that the
   number of served connections is lower than the total number of
   served connections in the backend divided by the number of servers
   currently in use :

      s->served <= c * s->proxy->served /
         (s->proxy->srv_act ? s->proxy->srv_act :
          s->proxy->lbprm.fbck ? 1 : s->proxy->srv_bck)

3) add a configuration option. I still don't know if it would be better
   in the backend or in the server. I'm tempted to think it's better to
   have it per backend (to avoid the risk that no server it picked) and
   to let people continue to use the weights to modulate the acceptation
   ratio as they've always been doing.

This last point makes me think that we should probably use "c * s->weight"
instead of just "c" so that each server's weight is considered in this
ratio. This is quite important otherwise in practice it will not be much
possible to apply different weights to servers.

Please do not hesitate to indicate if you need more help, if these elements
are not clear or whatever. I definitely want to help you get this great work
merged in 1.7.

Thanks!
Willy

Reply via email to