Hi Davide,

It seems I missed your response last time.

On Fri, Feb 18, 2022 at 04:47:29PM +0100, Davide Pucci wrote:
> > I'm having a number of questions about the motivations behind this
> > design, because I'm seeing a certain number of fundamental issues this
> > will cause and am not sure about the expected benefits nor about the
> > fact that such benefits couldn't be addressed differently. A few points:
> >
> > - I initially thought that it could result in strongly asymmetric loads
> >   to work this way, due to the duality between servers, but in pratice
> >   no server is other one's peer, in fact it's only for a given node that
> >   two servers are paired so we can estimate that when a server goes down,
> >   statistically all other ones will take over its load and not just one.
> 
> In cases where HAProxy is adopted to spread load across several CDN servers,
> it becomes crucial to minimize the variety of servers coupled with a given
> request, given the nature of the backend servers which are actively
> caching data/assets depending on the average requests done against them.
> So, yes, theoretically any backend server could pick the request up
> and handle it, but effectively, in order to maximize system performance,
> forcing the proxy to choose fixed couples (or single backends, if one of
> them is down), is preferred.

Sure, but I mean, that's nothing new at this point and is already what
those dealing with large contents are doing with the consistent hashing
and the hash-balance-factor, which makes your content available on one
to N nodes dependeing on the excess of load only.

> > - from my analysis you completely removed the support for weights, both
> >   static and dynamic. This is clearly not possible. Static weights are
> >   critically important as many users combine servers of different
> >   generations or capacities in a same farm, and even dynamic ones nowadays
> >   are demanded a lot (it was one of the motives for consistent hashing).
> 
> Yup, we ignore weights during the creation of the (b)tree as all the servers
> that compose the tree are supposed to be weighing the same and, more
> specifically,
> because, in the event of backends weights change, the three should not get
> unbalanced. That's because it would impact in a change of the backend
> serving a specific asset, and you can imagine the impact of it,
> if what we're proxying is a huge CDN/cache.

Well, I really disagree on this. We know that a number of users (especially
with hashes) occasionally adjust their weights to better spread the load,
and in addition lowering a weight is something particularly useful at
certain steps. Typically you replace a dead disk in your cache, it restarts
cold, you absolutely do not want to send 100% of the traffic to it, so
either you configure a slowstart (which requires a dynamic weight), or
you change it manually from time to time to let the disks fill with
contents.

> We could've considered keeping the weight during the tree creation,
> but only if the changes we're talking about are gonna "upset"
> the tree distribution within a certain threshold.
> This is the reason why all the variability, which is not strictly
> depending on the hashing algorithm, has been excluded, so to maximise
> the chances that always the same backends are gonna be picked to serve
> the same requests, such that those are gonna receive back "cached"
> responses.

But again I don't see in how that differs from what is already available.

> > - the fixed number of LB nodes will result in ~600kB of extra RAM per
> >   server, which starts to count a lot for large farms. This is a direct
> >   consequence of the removal of weight support above.
> 
> This is not entirely clear to me.
> What you mean with "LB nodes"? The HAProxy instances?

No, the tree nodes that are stored for each server to attach many
occurrences of it in the tree.

> And what with "per server"? Per backend?
> How did you get the ~600kB?

I don't have all the details in mind anymore but a quick check shows
that you're presenting a server 11587 times in the tree, and each
occurrence takes 48 bytes (struct tree_occ), hence a total of 556kB
per server. That's a lot!

> Would change anything to use a fixed backends weight, instead?

It's already fixed by your CHASH_NUM_NODES.

> > - also the fact that down servers (or drained ones) are not removed from
> >   the tree means that the algorithm will cost a huge amount of CPU cycles
> >   evaluating many nodes when some servers are in this state. This is a
> >   particular concern in multi-threading that we've already had to
> partially
> >   work around with some algos and queuing where a thread holds the lock
> >   for a very long time while others are waiting.
> 
> This is indeed a concern, but still a temporary critical scnario: in our
> case what wins is the load on the backends that are serving the resources,
> rather on the HAProxy(s). That's why we prefer to keep those (down) entries
> in the tree: to prevent the HAProxy(s) to pick different (tree-rebalanced)
> backend servers.

Maybe but we do know by experience that these are extremely
problematic. Just look for "abnormal CPU usage" or similar entries in
the issue tracker, and each time you'll find someone having to deal
with saturated servers that are still present and cause new attempts
at picking another node, thus increasing the lookup time and contention
between threads. We've spent a great time working around such issues by
trying to limit the probability that they appear, so it's hard to accept
that this approach that will cause the same trouble comes from a design
choice.

> > I think I can understand why you would want to have pairs of servers
> > sharing a same hash key, e.g. to pre-populate a cache with objects on
> > another node. But I'm really curious about what particular use case is
> > not properly covered by the bounded load hash algorithm that Andrew
> > Rodland  implemented a while back and that was clearly made to make
> > sure that objects were better shared between nodes, and not just 2
> > nodes, but the rest depending on the excess load.
> >
> > Maybe based on such use cases we could see how to refine your solution
> > and/or extend what already exists to cover your use case.
> 
> Obviously, it'd be best to switch our configuration to something
> already supported to achieve the very same behaviour.

But have you tried to use "hash-balance-factor" at all ? Andrew who
designed it certainly has a great experience with content delivery and
could show impressive "before vs after" graphs proving the effectiveness
of the mechanism, which comes with none of the significant comprommises
that your current proposal imposes.

> Or maybe it could be preferrable to just change the name of the proposed
> algorithm, in order to make it a bit more matching its purpose.
> Maybe something more scope-related, e.g. "cache-consistent",
> or more behaviour-related, e.g. "static-consistent".

Again, it's not as much a matter of naming as a matter of all the
particular cases it implies internally. Ignoring weights, causing CPU
contention on down servers and using lots of RAM must certainly come
with outstanding returns on investment, and for now I don't see the
solution even as good as what is already available since it's limited
to two servers and will not be able to automatically adapt to varying
loads on certain objects. In the end I mostly see several undesirable
side effects and no real gains over what already exists.

Regards,
Willy

Reply via email to