Hi Davide, sorry for the long delay, but your initial message was sent the day before the release and just after the second one we got derailed by a number of pending issues then that got lost in the noise. Anyway.
On Wed, Dec 01, 2021 at 04:49:53PM +0100, Davide Pucci wrote: > From d856de9813854a482a1dd52b73adf55859610fcb Mon Sep 17 00:00:00 2001 > From: Andrea Fagiani <[email protected]> > Date: Wed, 1 Dec 2021 16:25:08 +0100 > Subject: [PATCH] hash: add support for consistent (hashing) 2x algorithm > > This algorithm, unlike the traditional and already supported one, > first identifies the closest server to the request's hash (which > is called the primary) and, then, also the next one (the secondary). > > Next, they get evaluated as below: > * if both servers are usable, the algorithm picks one up randomly, > taking their weight into account; > * if only one of them is usable, then it's the one to be chosen; > * if both are unusable, then the algorithm keeps looking for > another one. > > If a server goes down, it stays in the tree, it does not get removed > from the cluster. In such case, the algorithm will take care to guide > the decision toward its primary/secondary server, depending on the > server status. > > The algorithm is deterministic on the primary/secondary servers couple, > whilst nondeterministic over which of the two will be picked up to > serve the request. This leads to a balance among the designated servers > which will serve the request. I'm having a number of questions about the motivations behind this design, because I'm seeing a certain number of fundamental issues this will cause and am not sure about the expected benefits nor about the fact that such benefits couldn't be addressed differently. A few points: - I initially thought that it could result in strongly asymmetric loads to work this way, due to the duality between servers, but in pratice no server is other one's peer, in fact it's only for a given node that two servers are paired so we can estimate that when a server goes down, statistically all other ones will take over its load and not just one. - from my analysis you completely removed the support for weights, both static and dynamic. This is clearly not possible. Static weights are critically important as many users combine servers of different generations or capacities in a same farm, and even dynamic ones nowadays are demanded a lot (it was one of the motives for consistent hashing). - the fixed number of LB nodes will result in ~600kB of extra RAM per server, which starts to count a lot for large farms. This is a direct consequence of the removal of weight support above. - also the fact that down servers (or drained ones) are not removed from the tree means that the algorithm will cost a huge amount of CPU cycles evaluating many nodes when some servers are in this state. This is a particular concern in multi-threading that we've already had to partially work around with some algos and queuing where a thread holds the lock for a very long time while others are waiting. I think I can understand why you would want to have pairs of servers sharing a same hash key, e.g. to pre-populate a cache with objects on another node. But I'm really curious about what particular use case is not properly covered by the bounded load hash algorithm that Andrew Rodland implemented a while back and that was clearly made to make sure that objects were better shared between nodes, and not just 2 nodes, but the rest depending on the excess load. Maybe based on such use cases we could see how to refine your solution and/or extend what already exists to cover your use case. Thanks! Willy > --- > include/haproxy/backend-t.h | 3 +- > include/haproxy/lb_chash.h | 1 + > src/cfgparse-listen.c | 3 + > src/cfgparse.c | 5 ++ > src/lb_chash.c | 129 ++++++++++++++++++++++++++++++++++-- > 5 files changed, 135 insertions(+), 6 deletions(-) > > diff --git a/include/haproxy/backend-t.h b/include/haproxy/backend-t.h > index 126528400..73cddebe4 100644 > --- a/include/haproxy/backend-t.h > +++ b/include/haproxy/backend-t.h > @@ -109,7 +109,8 @@ > /* hash types */ > #define BE_LB_HASH_MAP 0x000000 /* map-based hash (default) */ > #define BE_LB_HASH_CONS 0x100000 /* consistent hashbit to indicate a > dynamic algorithm */ > -#define BE_LB_HASH_TYPE 0x100000 /* get/clear hash types */ > +#define BE_LB_HASH_CONS2X 0x200000 /* consistent hashbit to indicate a > dynamic algorithm */ > +#define BE_LB_HASH_TYPE 0x300000 /* get/clear hash types */ > > /* additional modifier on top of the hash function (only avalanche right > now) */ > #define BE_LB_HMOD_AVAL 0x200000 /* avalanche modifier */ > diff --git a/include/haproxy/lb_chash.h b/include/haproxy/lb_chash.h > index 79504574b..0aa72793a 100644 > --- a/include/haproxy/lb_chash.h > +++ b/include/haproxy/lb_chash.h > @@ -30,6 +30,7 @@ struct server; > int chash_init_server_tree(struct proxy *p); > struct server *chash_get_next_server(struct proxy *p, struct server > *srvtoavoid); > struct server *chash_get_server_hash(struct proxy *p, unsigned int hash, > const struct server *avoid); > +struct server *chash_get_server_hash_c2x(struct proxy *p, unsigned int hash); > > #endif /* _HAPROXY_LB_CHASH_H */ > > diff --git a/src/cfgparse-listen.c b/src/cfgparse-listen.c > index 5deec5e6b..ef95fe5dc 100644 > --- a/src/cfgparse-listen.c > +++ b/src/cfgparse-listen.c > @@ -2673,6 +2673,9 @@ int cfg_parse_listen(const char *file, int linenum, > char **args, int kwm) > if (strcmp(args[1], "consistent") == 0) { /* use > consistent hashing */ > curproxy->lbprm.algo |= BE_LB_HASH_CONS; > } > + else if (strcmp(args[1], "consistent-2x") == 0) { /* use > consistent-2x hashing */ > + curproxy->lbprm.algo |= BE_LB_HASH_CONS2X; > + } > else if (strcmp(args[1], "map-based") == 0) { /* use > map-based hashing */ > curproxy->lbprm.algo |= BE_LB_HASH_MAP; > } > diff --git a/src/cfgparse.c b/src/cfgparse.c > index 58067c8ed..3b5d4f956 100644 > --- a/src/cfgparse.c > +++ b/src/cfgparse.c > @@ -3474,6 +3474,11 @@ int check_config_validity() > if (chash_init_server_tree(curproxy) < 0) { > cfgerr++; > } > + } else if ((curproxy->lbprm.algo & BE_LB_HASH_TYPE) == > BE_LB_HASH_CONS2X) { > + curproxy->lbprm.algo |= BE_LB_LKUP_CHTREE | > BE_LB_PROP_DYN; > + if (chash_init_server_tree(curproxy) < 0) { > + cfgerr++; > + } > } else { > curproxy->lbprm.algo |= BE_LB_LKUP_MAP; > init_server_map(curproxy); > diff --git a/src/lb_chash.c b/src/lb_chash.c > index 023219c98..d204b1fcc 100644 > --- a/src/lb_chash.c > +++ b/src/lb_chash.c > @@ -24,6 +24,10 @@ > #include <haproxy/server-t.h> > #include <haproxy/tools.h> > > +#define CHASH_NUM_NODES 11587 > +#define CHASH_BASE_NUM 1000 > +#define CHASH_SRV_BASE(srv) ((int)(srv->puid / CHASH_BASE_NUM)) > + > /* Return next tree node after <node> which must still be in the tree, or be > * NULL. Lookup wraps around the end to the beginning. If the next node is > the > * same node, return NULL. This is designed to find a valid next node before > @@ -155,7 +159,9 @@ static void chash_set_server_status_down(struct server > *srv) > p->srv_act--; > } > > - chash_dequeue_srv(srv); > + if ((p->lbprm.algo & BE_LB_HASH_TYPE) != BE_LB_HASH_CONS2X) { > + chash_dequeue_srv(srv); > + } > > out_update_backend: > /* check/update tot_used, tot_weight */ > @@ -217,7 +223,9 @@ static void chash_set_server_status_up(struct server *srv) > } > > /* note that eweight cannot be 0 here */ > - chash_queue_dequeue_srv(srv); > + if ((p->lbprm.algo & BE_LB_HASH_TYPE) != BE_LB_HASH_CONS2X) { > + chash_queue_dequeue_srv(srv); > + } > > out_update_backend: > /* check/update tot_used, tot_weight */ > @@ -268,7 +276,9 @@ static void chash_update_server_weight(struct server *srv) > HA_RWLOCK_WRLOCK(LBPRM_LOCK, &p->lbprm.lock); > > /* only adjust the server's presence in the tree */ > - chash_queue_dequeue_srv(srv); > + if ((p->lbprm.algo & BE_LB_HASH_TYPE) != BE_LB_HASH_CONS2X) { > + chash_queue_dequeue_srv(srv); > + } > > if (srv->flags & SRV_F_BACKUP) > p->lbprm.tot_wbck += srv->next_eweight - srv->cur_eweight; > @@ -309,6 +319,107 @@ int chash_server_is_eligible(struct server *s) > return s->served < slots; > } > > +/* > + * This function returns the running server from the CHASH (2x) tree. > + * Differently from the traditional chash_get_server_hash() function, > + * this first picks two servers <primary> and <secondary>, > + * which are at the closest distance from the value of <hash>. > + * If both are usable, it randomly chooses between them, taking their > + * weights into account, and returns it. > + * If one of them is unusable, it returns the latter. > + * If both are unusable, then it looks for the next usable server in the > + * tree. > + * Keep in mind that when a server goes down, it does not get excluded > + * from the cluster, staying in the tree: the algorithm itself will > + * guide its decision towards its primary/secondary. > + */ > +struct server *chash_get_server_hash_c2x(struct proxy *p, unsigned int hash) > +{ > + struct eb32_node *start, *cur; > + struct server *srv = NULL, *primary = NULL, *secondary = NULL, > *first_usable = NULL; > + struct eb_root *root; > + > + if (p->srv_act) > + root = &p->lbprm.chash.act; > + else if (p->lbprm.fbck) > + return p->lbprm.fbck; > + else if (p->srv_bck) > + root = &p->lbprm.chash.bck; > + else > + return NULL; > + > + /* find the node after and the node before */ > + cur = eb32_lookup_ge(root, hash); > + if (!cur) > + cur = eb32_first(root); > + if (!cur) > + return NULL; /* tree is empty */ > + > + start = cur; > + primary = eb32_entry(cur, struct tree_occ, node)->server; > + int base_id = CHASH_SRV_BASE(primary); > + > + // find the secondary replica > + cur = eb32_next(start); > + if (!cur) { > + cur = eb32_first(root); > + } > + > + while (cur != start) { > + srv = eb32_entry(cur, struct tree_occ, node)->server; > + if (CHASH_SRV_BASE(srv) != base_id) { > + secondary = srv; > + break; > + } > + > + cur = eb32_next(cur); > + if (!cur) { > + cur = eb32_first(root); > + } > + } > + > + int primary_usable = srv_currently_usable (primary); > + int secondary_usable = secondary ? srv_currently_usable (secondary) : 0; > + > + if (primary_usable && secondary_usable) { > + uint64_t rsum = primary->cur_eweight + secondary->cur_eweight; > + // copied from sample.c:smp_fetch_rand() > + uint64_t r = ((uint64_t)random() * rsum) / > ((uint64_t)RAND_MAX+1); > + return r < primary->cur_eweight ? primary : secondary; > + } else if (primary_usable) { > + return primary; > + } else if (secondary_usable) { > + return secondary; > + } else { > + // find a usable server, possibly with different base, > regardless of the weight > + cur = eb32_next(start); > + if (!cur) { > + cur = eb32_first(root); > + } > + > + while (cur != start) { > + srv = eb32_entry(cur, struct tree_occ, node)->server; > + if (srv_currently_usable(srv)) { > + if (CHASH_SRV_BASE(srv) != base_id) { > + return srv; > + } else if (!first_usable) { > + first_usable = srv; > + } > + } > + > + cur = eb32_next(cur); > + if (!cur) { > + cur = eb32_first(root); > + } > + } > + } > + > + // at this point we couldn't find any replica with different base > + // just return the first usable server, if any > + > + return first_usable; > +} > + > /* > * This function returns the running server from the CHASH tree, which is at > * the closest distance from the value of <hash>. Doing so ensures that even > @@ -326,6 +437,10 @@ struct server *chash_get_server_hash(struct proxy *p, > unsigned int hash, const s > struct eb_root *root; > unsigned int dn, dp; > int loop; > + if ((p->lbprm.algo & BE_LB_HASH_TYPE) == BE_LB_HASH_CONS2X) { > + nsrv = chash_get_server_hash_c2x(p, hash); > + return nsrv; > + } > > HA_RWLOCK_RDLOCK(LBPRM_LOCK, &p->lbprm.lock); > > @@ -497,7 +612,11 @@ int chash_init_server_tree(struct proxy *p) > /* queue active and backup servers in two distinct groups */ > for (srv = p->srv; srv; srv = srv->next) { > srv->lb_tree = (srv->flags & SRV_F_BACKUP) ? > &p->lbprm.chash.bck : &p->lbprm.chash.act; > - srv->lb_nodes_tot = srv->uweight * BE_WEIGHT_SCALE; > + if ((p->lbprm.algo & BE_LB_HASH_TYPE) == BE_LB_HASH_CONS2X) { > + srv->lb_nodes_tot = CHASH_NUM_NODES; > + } else { > + srv->lb_nodes_tot = srv->uweight * BE_WEIGHT_SCALE; > + } > srv->lb_nodes_now = 0; > srv->lb_nodes = calloc(srv->lb_nodes_tot, > sizeof(*srv->lb_nodes)); > @@ -510,7 +629,7 @@ int chash_init_server_tree(struct proxy *p) > srv->lb_nodes[node].node.key = full_hash(srv->puid * > SRV_EWGHT_RANGE + node); > } > > - if (srv_currently_usable(srv)) > + if (((p->lbprm.algo & BE_LB_HASH_TYPE) == BE_LB_HASH_CONS2X) || > srv_currently_usable(srv)) > chash_queue_dequeue_srv(srv); > } > return 0; > -- > 2.30.2 >

