Re: Experimental patch: Consistent Hashing with Bounded Loads

2016-09-19 Thread Willy Tarreau
Hi Andrew,

On Mon, Sep 19, 2016 at 11:32:49AM -0400, Andrew Rodland wrote:
(...)
> I haven't found the cause of this, or been able to pin it down much further 
> than that it happens fairly reliably when doing a "haproxy -sf" restart under 
> load.

OK I'll have to test it.

> Other than that, I think I have things working properly and would 
> appreciate a bit of review. My changes are on the "bounded-chash" branch of 
> github.com/arodland/haproxy ??? or would you prefer a patch series sent to 
> the 
> list?

It's better and more convenient to send the patch series to the list. It's
easy to respond inline, and opens the review to everyone, as everyone may
have an opinion on the core or even good suggestions.

Thanks,
Willy



Re: Experimental patch: Consistent Hashing with Bounded Loads

2016-09-19 Thread Andrew Rodland
On Thursday, September 15, 2016 4:06:15 AM EDT Willy Tarreau wrote:
> Hi Andrew,
> 
> On Wed, Sep 14, 2016 at 02:44:26PM -0400, Andrew Rodland wrote:
> > On Sunday, September 11, 2016 7:57:41 PM EDT Willy Tarreau wrote:
> > > > > Also I've been thinking about this issue of the infinite loop that
> > > > > you
> > > > > solved already. As long as c > 1 I don't think it can happen at all,
> > > > > because for any server having a load strictly greater than the
> > > > > average
> > > > > load, it means there exists at least one server with a load smaller
> > > > > than
> > > > > or equal to the average. Otherwise it means there's no more server
> > > > > in
> > > > > the ring because all servers are down, and then the initial lookup
> > > > > will
> > > > > simply return NULL. Maybe there's an issue with the current lookup
> > > > > method, we'll have to study this.
> > > > 
> > > > Agreed again, it should be impossible as long as c > 1, but I ran into
> > > > it.
> > > > I assumed it was some problem or misunderstanding in my code.
> > > 
> > > Don't worry I trust you, I was trying to figure what exact case could
> > > cause this and couldn't find a single possible case :-/
> > 
> > I've encountered this again in my re-written branch. I think it has to do
> > with the case where all servers are draining for shutdown. What I see is
> > that whenever I do a restart (haproxy -sf oldpid) under load, the new
> > process starts up, but the old process never exits, and perf shows it
> > using 100% CPU in chash_server_is_eligible, so it's got to be looping and
> > deciding nothing is eligible. Can you think of anything special that
> > needs to be done to handle graceful shutdown?
> 
> No, that's very strange. We may have a bug somewhere else which never
> stroke till now. When you talk about a shutdown, you in fact mean the
> shutdown of the haproxy process being replaced by another one, that's
> right ? If so, health checks are disabled during that period so servers
> should not be added to nor removed from the ring.
> 
> However if for any reason there's a graceful shutdown on the servers,
> their weight can be set to zero while they're still active. In this
> case they don't appear in the tree and that may be where the issue
> starts. It would be nice to get a 100% reproducible case to try to
> debug it and dump all weights and capacities, I think it would help.
> 
> Willy

I haven't found the cause of this, or been able to pin it down much further 
than that it happens fairly reliably when doing a "haproxy -sf" restart under 
load. Other than that, I think I have things working properly and would 
appreciate a bit of review. My changes are on the "bounded-chash" branch of 
github.com/arodland/haproxy — or would you prefer a patch series sent to the 
list?

Thanks,

Andrew




Re: Experimental patch: Consistent Hashing with Bounded Loads

2016-09-14 Thread Willy Tarreau
Hi Andrew,

On Wed, Sep 14, 2016 at 02:44:26PM -0400, Andrew Rodland wrote:
> On Sunday, September 11, 2016 7:57:41 PM EDT Willy Tarreau wrote:
> > > > Also I've been thinking about this issue of the infinite loop that you
> > > > solved already. As long as c > 1 I don't think it can happen at all,
> > > > because for any server having a load strictly greater than the average
> > > > load, it means there exists at least one server with a load smaller than
> > > > or equal to the average. Otherwise it means there's no more server in
> > > > the ring because all servers are down, and then the initial lookup will
> > > > simply return NULL. Maybe there's an issue with the current lookup
> > > > method, we'll have to study this.
> > > 
> > > Agreed again, it should be impossible as long as c > 1, but I ran into it.
> > > I assumed it was some problem or misunderstanding in my code.
> > 
> > Don't worry I trust you, I was trying to figure what exact case could
> > cause this and couldn't find a single possible case :-/
> 
> I've encountered this again in my re-written branch. I think it has to do 
> with 
> the case where all servers are draining for shutdown. What I see is that 
> whenever I do a restart (haproxy -sf oldpid) under load, the new process 
> starts up, but the old process never exits, and perf shows it using 100% CPU 
> in chash_server_is_eligible, so it's got to be looping and deciding nothing 
> is 
> eligible. Can you think of anything special that needs to be done to handle 
> graceful shutdown?

No, that's very strange. We may have a bug somewhere else which never
stroke till now. When you talk about a shutdown, you in fact mean the
shutdown of the haproxy process being replaced by another one, that's
right ? If so, health checks are disabled during that period so servers
should not be added to nor removed from the ring.

However if for any reason there's a graceful shutdown on the servers,
their weight can be set to zero while they're still active. In this
case they don't appear in the tree and that may be where the issue
starts. It would be nice to get a 100% reproducible case to try to
debug it and dump all weights and capacities, I think it would help.

Willy



Re: Experimental patch: Consistent Hashing with Bounded Loads

2016-09-14 Thread Andrew Rodland
On Sunday, September 11, 2016 7:57:41 PM EDT Willy Tarreau wrote:
> > > Also I've been thinking about this issue of the infinite loop that you
> > > solved already. As long as c > 1 I don't think it can happen at all,
> > > because for any server having a load strictly greater than the average
> > > load, it means there exists at least one server with a load smaller than
> > > or equal to the average. Otherwise it means there's no more server in
> > > the ring because all servers are down, and then the initial lookup will
> > > simply return NULL. Maybe there's an issue with the current lookup
> > > method, we'll have to study this.
> > 
> > Agreed again, it should be impossible as long as c > 1, but I ran into it.
> > I assumed it was some problem or misunderstanding in my code.
> 
> Don't worry I trust you, I was trying to figure what exact case could
> cause this and couldn't find a single possible case :-/

I've encountered this again in my re-written branch. I think it has to do with 
the case where all servers are draining for shutdown. What I see is that 
whenever I do a restart (haproxy -sf oldpid) under load, the new process 
starts up, but the old process never exits, and perf shows it using 100% CPU 
in chash_server_is_eligible, so it's got to be looping and deciding nothing is 
eligible. Can you think of anything special that needs to be done to handle 
graceful shutdown?

Thanks,

Andrew



Re: Experimental patch: Consistent Hashing with Bounded Loads

2016-09-12 Thread Andrew Rodland
On Sunday, September 11, 2016 7:57:41 PM EDT Willy Tarreau wrote:
> Hi Andrew,
> 
> On Sun, Sep 11, 2016 at 01:45:35PM -0400, Andrew Rodland wrote:
> > On Sunday, September 11, 2016 9:11:50 AM EDT Willy Tarreau wrote:
> > > Don't worry, that's the principle of a PoC. I looked at your code. I
> > > think
> > > we can significantly simplify it. The current server's relative capacity
> > > should be computed on the fly when trying to pick the server in the
> > > ring.
> > > Right now you call this function chash_update_server_capacities() to
> > > recompute all servers capacities before chosing your server but given
> > > that
> > > you already have the information on the number of total served
> > > connections
> > > in the backend, it's possible to modify the current lookup function to
> > > void running this loop first. This is especially important as some
> > > setups
> > > have very large numbers of servers (eg: up to 1000).
> > 
> > I agree that this would be better, but the algorithm is supposed to
> > distribute the capacity among servers in a particular way. Concretely,
> > say that the computed capacity is 14 and there are 10 servers. Then there
> > should be 4 servers with a capacity of 2 and 6 servers with a capacity of
> > 1, and which are which shouldn't depend on where the request hashed to
> > ??? there needs to be some way of identifying the "first" 4 servers.
> > Walking the hash ring seems like an inefficient way to do it, I agree,
> > but I didn't have a better one.
> Wait a minute, I think you're talking about the while() loop used to
> find the next non-saturated server. I agree this one cannot be removed
> otherwise we create a domino effect by overloading a specific server
> with all requests initially made for another server. Here I'm talking
> about the initial function which is used to compute all relative
> capacities (sorry I don't have its name in mind and don't have access
> to the sources right now). Since the servers' capacities do not change
> during the loop, it's better to update the backend's average load as
> servers take or release connections and be able to compute a server's
> weight on the fly instead of pre-computing it.

No, I mean the first loop. If the average load times c factor is non-integer, 
the algorithm gives some servers an effective capacity of the next integer up, 
and some the next integer down. This is easily done, except that you have to 
have a way of knowing whether the server you're looking at is in the "first" n 
or not, according to some stable ordering of the servers. 

Everything else, i think I have figured out, including the math for the 
weights, and I will get to work as I have time.

Andrew



Re: Experimental patch: Consistent Hashing with Bounded Loads

2016-09-11 Thread Willy Tarreau
Hi Andrew,

On Sun, Sep 11, 2016 at 01:45:35PM -0400, Andrew Rodland wrote:
> On Sunday, September 11, 2016 9:11:50 AM EDT Willy Tarreau wrote:
> > Don't worry, that's the principle of a PoC. I looked at your code. I think
> > we can significantly simplify it. The current server's relative capacity
> > should be computed on the fly when trying to pick the server in the ring.
> > Right now you call this function chash_update_server_capacities() to
> > recompute all servers capacities before chosing your server but given that
> > you already have the information on the number of total served connections
> > in the backend, it's possible to modify the current lookup function to
> > void running this loop first. This is especially important as some setups
> > have very large numbers of servers (eg: up to 1000).
> 
> I agree that this would be better, but the algorithm is supposed to 
> distribute 
> the capacity among servers in a particular way. Concretely, say that the 
> computed capacity is 14 and there are 10 servers. Then there should be 4 
> servers with a capacity of 2 and 6 servers with a capacity of 1, and which 
> are 
> which shouldn't depend on where the request hashed to ??? there needs to be 
> some 
> way of identifying the "first" 4 servers. Walking the hash ring seems like an 
> inefficient way to do it, I agree, but I didn't have a better one.

Wait a minute, I think you're talking about the while() loop used to
find the next non-saturated server. I agree this one cannot be removed
otherwise we create a domino effect by overloading a specific server
with all requests initially made for another server. Here I'm talking
about the initial function which is used to compute all relative
capacities (sorry I don't have its name in mind and don't have access
to the sources right now). Since the servers' capacities do not change
during the loop, it's better to update the backend's average load as
servers take or release connections and be able to compute a server's
weight on the fly instead of pre-computing it.

> > Also I've been thinking about this issue of the infinite loop that you
> > solved already. As long as c > 1 I don't think it can happen at all,
> > because for any server having a load strictly greater than the average
> > load, it means there exists at least one server with a load smaller than
> > or equal to the average. Otherwise it means there's no more server in
> > the ring because all servers are down, and then the initial lookup will
> > simply return NULL. Maybe there's an issue with the current lookup
> > method, we'll have to study this.
> 
> Agreed again, it should be impossible as long as c > 1, but I ran into it. I 
> assumed it was some problem or misunderstanding in my code.

Don't worry I trust you, I was trying to figure what exact case could
cause this and couldn't find a single possible case :-/

> > 3) add a configuration option. I still don't know if it would be better
> >in the backend or in the server. I'm tempted to think it's better to
> >have it per backend (to avoid the risk that no server it picked) and
> >to let people continue to use the weights to modulate the acceptation
> >ratio as they've always been doing.
> > 
> > This last point makes me think that we should probably use "c * s->weight"
> > instead of just "c" so that each server's weight is considered in this
> > ratio. This is quite important otherwise in practice it will not be much
> > possible to apply different weights to servers.
> 
> I will have to think on how to do the weights properly, and I will probably 
> do 
> my first stab without them, but I do follow your logic.

We already use weights everywhere, namely with leastconn. While I have
no problem with reviewing a code update without them, I'd definitely
want to have them for the merge, otherwise we'd create two different
behaviours which would be confusing or even considered as bogus. Let's
reuse the same logic we have at other places (ie: number of conns per
server is divided by their relative weight so that the comparison is
performed on normalized weights).

> As for the configuration parameter, it's okay to add a double and use atof? 
> Normally I wouldn't ask, but since haproxy has no floats at all I'm double-
> checking before introducing them.

That's an excellent question, I didn't think about it. What would you
think about using integer percentages instead ? Many people produce or
update their configs using shell scripts and floats would cause some
trouble there. Using a percentage would work exactly like for weights
that cause no issue at the moment. And I don't think you'll need a
resolution finer than 1% to adjust the relative capacities. Here all
values would be 100 or larger.

Cheers,
Willy



Re: Experimental patch: Consistent Hashing with Bounded Loads

2016-09-11 Thread Andrew Rodland
On Sunday, September 11, 2016 9:11:50 AM EDT Willy Tarreau wrote:
> Don't worry, that's the principle of a PoC. I looked at your code. I think
> we can significantly simplify it. The current server's relative capacity
> should be computed on the fly when trying to pick the server in the ring.
> Right now you call this function chash_update_server_capacities() to
> recompute all servers capacities before chosing your server but given that
> you already have the information on the number of total served connections
> in the backend, it's possible to modify the current lookup function to
> void running this loop first. This is especially important as some setups
> have very large numbers of servers (eg: up to 1000).

I agree that this would be better, but the algorithm is supposed to distribute 
the capacity among servers in a particular way. Concretely, say that the 
computed capacity is 14 and there are 10 servers. Then there should be 4 
servers with a capacity of 2 and 6 servers with a capacity of 1, and which are 
which shouldn't depend on where the request hashed to — there needs to be some 
way of identifying the "first" 4 servers. Walking the hash ring seems like an 
inefficient way to do it, I agree, but I didn't have a better one.

> Also I've been thinking about this issue of the infinite loop that you
> solved already. As long as c > 1 I don't think it can happen at all,
> because for any server having a load strictly greater than the average
> load, it means there exists at least one server with a load smaller than
> or equal to the average. Otherwise it means there's no more server in
> the ring because all servers are down, and then the initial lookup will
> simply return NULL. Maybe there's an issue with the current lookup
> method, we'll have to study this.

Agreed again, it should be impossible as long as c > 1, but I ran into it. I 
assumed it was some problem or misunderstanding in my code.

> For the next steps, I would suggest you to try to proceed this way :
> 
> 1) modify all places which adjust srv->served to also adjust
>srv->proxy->served (I thought it was already done but not). This
>way you always know the amount of connections being served for a
>given backend.

Easy enough.

> 2) call a function in your while() loop to check each server's eligibility
>based on the C parameter. We should consider that C==0 implies the
>mechanism is not used. This would give something approximately like
>this :
> 
>while (c && !ch_server_is_eligible(s, c)) {
>   s = next(s);
>   ...
>}
> 
>And this function ch_server_is_eligible() would check that the
>number of served connections is lower than the total number of
>served connections in the backend divided by the number of servers
>currently in use :
> 
>   s->served <= c * s->proxy->served /
>  (s->proxy->srv_act ? s->proxy->srv_act :
>   s->proxy->lbprm.fbck ? 1 : s->proxy->srv_bck)

Agreed, except for the caveat I mentioned above.

> 3) add a configuration option. I still don't know if it would be better
>in the backend or in the server. I'm tempted to think it's better to
>have it per backend (to avoid the risk that no server it picked) and
>to let people continue to use the weights to modulate the acceptation
>ratio as they've always been doing.
> 
> This last point makes me think that we should probably use "c * s->weight"
> instead of just "c" so that each server's weight is considered in this
> ratio. This is quite important otherwise in practice it will not be much
> possible to apply different weights to servers.

I will have to think on how to do the weights properly, and I will probably do 
my first stab without them, but I do follow your logic.

As for the configuration parameter, it's okay to add a double and use atof? 
Normally I wouldn't ask, but since haproxy has no floats at all I'm double-
checking before introducing them.

Thanks,

Andrew




Re: Experimental patch: Consistent Hashing with Bounded Loads

2016-09-11 Thread Willy Tarreau
Hi Andrew,

On Wed, Sep 07, 2016 at 02:28:47PM -0400, Andrew Rodland wrote:
> Hello all,
> 
> === Background material, skip this if you just want to know what I've done ===
> 
> For some time I've been wishing for a way to combine the best aspects of 
> consistent hashing and "balance leastconn". I use haproxy to distribute 
> traffic 
> across a cluster of servers doing dynamic video delivery. Without going into 
> too much detail, there's a substantial cache effect ??? there's a bunch of 
> information that needs to be calculated for each video, and if that 
> information is cached, then the request is quite a bit faster and uses less 
> resources. Also, the server cluster runs in the cloud and auto-scales based 
> on 
> load.
> 
> Initially our approach was to cache the information locally on each server, 
> and use uri-based balancing with consistent hashing to distribute the traffic 
> among servers according to video ID. This gave a satisfying number of cache 
> hits, but had the problem that a single server could become overloaded by 
> receiving the traffic for one or two exceptionally popular videos, resulting 
> in 
> bad performance.
> 
> Our second whack was to change to "balance leastconn" (which destroys the 
> cache hit ratio) and add a second layer of shared cache. This gives more 
> consistent performance, but the bandwidth used for cache coherency is 
> substantial, and the cache servers create a single point of failure.
> 
> What I really wanted was a policy that uses consistent hashing most of the 
> time, but at the same time, avoids overloading any given server, by diverting 
> traffic from an overloaded server to a less-loaded server, in a consistent 
> order 
> based on the consistent hash ring. I did some experimenting with algorithms, 
> but didn't find a satisfactory one.
> 
> === End of background material, here's the paper and the patch ===

That's an interesting subject. I've been used to advice people to proceed
various ways to deal with this, one of them being to use two backends, one
running with consistent hash and another one with leastconn or round robin,
and to decide which one to use based on the number of total connections, on
the current object's concurrent connections or on each server's connection
count. But none of these methods was perfect since they wouldn't consider
the load caused by the hash-based backend.

> Recently I came upon the paper "Consistent Hashing with Bounded Loads", 
> http://arxiv.org/abs/1608.01350 . It describes a very simple algorithm for 
> doing exactly what I wanted. It defines a parameter c > 1, and enforces that 
> no 
> server receives more than ceil(c * avg(conns)) connections. If a server is 
> already at the limit, it goes around the hash ring until it finds one that is 
> below the limit, and assigns the connection to that server. A large "c" 
> provides stricter hashing (a greater chance of requests with the same hash 
> key 
> going to the same server), while a small "c" provides stricter balancing 
> (less 
> difference between the smallest and largest number of connections to a 
> server). I simulated this algorithm, found that it would work better than the 
> ones I tested before, and then implemented it in HAProxy. On testing in 
> production I saw reduced 95th-percentile response times, and a great 
> reduction 
> in cache-coherency traffic.

I'm not surprized at all. The method is interesting.

> I'm attaching the patch that I tested. However, I don't expect that it is in 
> any way mergeable. I'm not very familiar with the HAProxy codebase and I did 
> a 
> number of things that are almost certainly wrong. I replaced the existing 
> chash_get_server_hash instead of attempting to add a new method or flag to 
> the 
> configuration. The "c" parameter is hard-coded instead of configurable, as 
> haproxy currently has no awareness of non-integer numbers, and I wasn't sure 
> how to deal with that. My code for enumerating servers and assigning 
> capacities is probably suboptimal, although it didn't cause a noticeable bump 
> in CPU usage on my production systems. There are probably any number of 
> coding-style gaffes. And there is at least one bug that causes haproxy to 
> become unresponsive when servers are marked down.

Don't worry, that's the principle of a PoC. I looked at your code. I think
we can significantly simplify it. The current server's relative capacity
should be computed on the fly when trying to pick the server in the ring.
Right now you call this function chash_update_server_capacities() to
recompute all servers capacities before chosing your server but given that
you already have the information on the number of total served connections
in the backend, it's possible to modify the current lookup function to
void running this loop first. This is especially important as some setups
have very large numbers of servers (eg: up to 1000).

Also I've been thinking about this issue of the infinite loop that you
solve

Experimental patch: Consistent Hashing with Bounded Loads

2016-09-07 Thread Andrew Rodland
Hello all,

=== Background material, skip this if you just want to know what I've done ===

For some time I've been wishing for a way to combine the best aspects of 
consistent hashing and "balance leastconn". I use haproxy to distribute traffic 
across a cluster of servers doing dynamic video delivery. Without going into 
too much detail, there's a substantial cache effect — there's a bunch of 
information that needs to be calculated for each video, and if that 
information is cached, then the request is quite a bit faster and uses less 
resources. Also, the server cluster runs in the cloud and auto-scales based on 
load.

Initially our approach was to cache the information locally on each server, 
and use uri-based balancing with consistent hashing to distribute the traffic 
among servers according to video ID. This gave a satisfying number of cache 
hits, but had the problem that a single server could become overloaded by 
receiving the traffic for one or two exceptionally popular videos, resulting in 
bad performance.

Our second whack was to change to "balance leastconn" (which destroys the 
cache hit ratio) and add a second layer of shared cache. This gives more 
consistent performance, but the bandwidth used for cache coherency is 
substantial, and the cache servers create a single point of failure.

What I really wanted was a policy that uses consistent hashing most of the 
time, but at the same time, avoids overloading any given server, by diverting 
traffic from an overloaded server to a less-loaded server, in a consistent 
order 
based on the consistent hash ring. I did some experimenting with algorithms, 
but didn't find a satisfactory one.

=== End of background material, here's the paper and the patch ===

Recently I came upon the paper "Consistent Hashing with Bounded Loads", 
http://arxiv.org/abs/1608.01350 . It describes a very simple algorithm for 
doing exactly what I wanted. It defines a parameter c > 1, and enforces that no 
server receives more than ceil(c * avg(conns)) connections. If a server is 
already at the limit, it goes around the hash ring until it finds one that is 
below the limit, and assigns the connection to that server. A large "c" 
provides stricter hashing (a greater chance of requests with the same hash key 
going to the same server), while a small "c" provides stricter balancing (less 
difference between the smallest and largest number of connections to a 
server). I simulated this algorithm, found that it would work better than the 
ones I tested before, and then implemented it in HAProxy. On testing in 
production I saw reduced 95th-percentile response times, and a great reduction 
in cache-coherency traffic.

I'm attaching the patch that I tested. However, I don't expect that it is in 
any way mergeable. I'm not very familiar with the HAProxy codebase and I did a 
number of things that are almost certainly wrong. I replaced the existing 
chash_get_server_hash instead of attempting to add a new method or flag to the 
configuration. The "c" parameter is hard-coded instead of configurable, as 
haproxy currently has no awareness of non-integer numbers, and I wasn't sure 
how to deal with that. My code for enumerating servers and assigning 
capacities is probably suboptimal, although it didn't cause a noticeable bump 
in CPU usage on my production systems. There are probably any number of 
coding-style gaffes. And there is at least one bug that causes haproxy to 
become unresponsive when servers are marked down.

That being said, I would be thrilled to work with someone to repair all of 
these problems, if there is interest in having this feature in HAProxy. I 
think that it's an exciting feature and it would be really great to add to 
HAProxy's already excellent set of load-balancing tools. I just need a little 
guidance to make it happen.

My changes are on Github at https://github.com/arodland/haproxy and they're 
up-to-date against haproxy master. I'd encourage anyone to review the code, 
and to contact me by email, on IRC (I'm on #haproxy as hobbs), or on GitHub.

Thank you,

Andrew