Hi Willy, Thanks for your tome treatment of my ideas! I forgot how much I enjoyed reading them. :)
>> To dig up an old discussion--I took a look at better support for SRV records >> (using the priority field as backup/non-backup, etc.) a few weeks ago, but >> determined it didn't make sense in our use case. The issue is 0 weighted >> servers are considerably less useful to us since they aren't ever used, even >> in the condition where every other server is down. > > I seem to remember a discussion about making this configurable but I > don't seem to see any commit matching anything like that, so maybe the > discussion ended up in "change the behavior again the previous one was > wrong", I don't remember well. It was quite a long time ago (March), but I didn't have a chance to test behavior and look at the code until a few weeks ago. > With your approach it would be almost identical except that we would > always have two load-balancing groups, a primary one and a secondary > one, the first one made only of the active servers and the second one > made only of the backup servers. Great! I'm glad it isn't a huge departure from the present code. > We would then pick from the first > list and if it's empty, then the next one. This slightly concerns me. Hopefully I'm just not quite understanding the behavior. Would that imply request A would pick from the primary server group for all backend requests (including retires) unless the primary is 100% down / empty? An ideal path for us (as odd as it may sound), is to allow the ability for request A to go to the primary group first, then optionally redispatch to secondary group. This isn't currently possible, and is the source of most of our remaining 5xx errors. > We'd just document that the keyword "backup" means "server of the > secondary group", and probably figure new actions or decisions to > force to use one group over the other one. I think if these actions are capable of changing the group picked by retries, that addresses my concerns. > I'm dumping all that in case it can help you get a better idea of the > various mid-term possibilities and what the steps could be (and also what > not to do if we don't want to shoot ourselves in the foot). That helps my understanding quite a bit, too! Regarding queues, LB algorithms, and such, this is of lesser concern for us. We want to reasonably fairly pick backends, but beyond that, we don't much care (perhaps therein lies the rub). I was a bit surprised to read that requests are queued for particular servers vs for a particular group at the moment, which has some interesting implications for L7 retries based on 5xx errors which in turn result in the server being marked down. It could explain why we're seeing occasional edge cases of errors that don't make complete sense. (Request D comes in, is scheduled for a server, the server goes down along with the rest of the group due to Requests A, B, and C failing, Request D then fails by default, since the group is empty.) A first step towards this would be to allow requests to be redispatched to the backup group. That would eliminate many of our issues. We're fine with a few slower requests if we know they'll likely succeed the second time around (because the slow region is not handling both). It'd likely help our 99p and 999p times a good bit. I was hoping 0 weighted servers would allow for this, but I was mistaken, since 0 weighted servers are even less used than backup servers. :-) I hope this helps clarify our needs. Best, Luke — Luke Seelenbinder Stadia Maps | Founder stadiamaps.com > On 8 Jul 2020, at 19:34, Willy Tarreau <[email protected]> wrote: > > Hi Luke! > > On Wed, Jul 08, 2020 at 11:57:15AM +0200, Luke Seelenbinder wrote: >> I've been following along the torturous road, and I'm happy to see all the >> issues resolved and the excellent results. > > You can imagine how I am as well :-) > >> Personally, I'm excited about the >> performance gains. I'll deploy this soon on our network. > > OK! > >> To dig up an old discussion--I took a look at better support for SRV records >> (using the priority field as backup/non-backup, etc.) a few weeks ago, but >> determined it didn't make sense in our use case. The issue is 0 weighted >> servers are considerably less useful to us since they aren't ever used, even >> in the condition where every other server is down. > > I seem to remember a discussion about making this configurable but I > don't seem to see any commit matching anything like that, so maybe the > discussion ended up in "change the behavior again the previous one was > wrong", I don't remember well. > >> That raises the next question: is the idea of server groups (with the ability >> for a request to try group 1, then group 2, etc. on retries) in the >> development plans at some point? Would that be something I could tinker as a >> longer term project? > > That could indeed be an interesting approach because we already almost do > that between active and backup servers, except that there is always one > single group at a time. In fact there are 4 possible states for a servers > group: > > - populated only with all active servers which are UP or unchecked, > provided that there is at least one such server ; > > - populated only with all backup servers which are UP or unchecked, > provided there is at least one such server, that no active server > exists in UP or unchecked state, and that option useallbackups is > set; > > - populated with the first UP or unchecked backup server, provided that > there is at last one such server, that no active server exists in UP > or unchecked state, and that option useallbackups is not set; > > - no server: all are down ; > > With your approach it would be almost identical except that we would > always have two load-balancing groups, a primary one and a secondary > one, the first one made only of the active servers and the second one > made only of the backup servers. We would then pick from the first > list and if it's empty, then the next one. > > I shouldn't even consume too much memory since the structures used to > attach the servers to the group are carried by the servers themselves. > Only static hash-based algorithm would cause a memory increase on the > backend but they're rarely used with many servers due to the high risk > of rebalancing so I gues that could be a pretty reasonable change. > > We'd just document that the keyword "backup" means "server of the > secondary group", and probably figure new actions or decisions to > force to use one group over the other one. > > Please note that I'd rather avoid adding too many groups into a farm > because we don't want to start to scan many of them. If keeping 2 as > we have today is already sufficient for your use case, I'd rather > stick to this. > > We still need to put a bit more thoughts on this because I vaguely > remember an old discussion where someone wanted to use a different > LB algorithm for the backup servers. Here in terms of implementation > it would not be a big deal, we could have one LB algo per group. But > in terms of configuration (for the user) and configuration storage > (in the code), it would be a real pain. But possibly that it would > still be worth the price if it starts to allow to assemble a backend > by "merging" several groups (that's a crazy old idea that has been > floating around for 10+ years and which could possibly make sense in > the future to address certain use cases). > > If you're interested in going on these ideas, please, oh please, never > forget about the queues (those that are used when you set a maxconn > parameter), because their behavior is thightly coupled with the LB > algorithms, and the difficulty is to make sure a server which frees > a connection slot can immediately pick the oldest pending request > either in its own queue (server already assigned) or the backend's > (don't care about what server handles the request). This may become > more difficult when dealing with several groups, hence possibly queues. > > My secret agenda would ideally be to one day support shared server groups > with their own queues between multiple backends so that we don't even > need to divide the servers' maxconn anymore. But it's still lacking some > reflexion. > > I'm dumping all that in case it can help you get a better idea of the > various mid-term possibilities and what the steps could be (and also what > not to do if we don't want to shoot ourselves in the foot). > > Cheers, > Willy

