Hi Willy.

On 2025-07-24 (Do.) 08:34, Willy Tarreau wrote:
Hi all,

I'm sending here what I posted there:

     https://github.com/orgs/haproxy/discussions/3042

Please use whatever suits you best to respond.

In 3.2, a lot of work was done to reduce the locking cost in the LB
algorithms, making roundrobin and leastconn way more performant than they
were before, particularly on machines with many cores and multiple core
complexes like EPYC CPUs where leastconn used to be really bad in the
past.

However, this work raised a question that's still valid: why do we still
keep roundrobin as the default algorithm, except for historical reasons ?
In many of our tests we're using random(2) aka random, because it's much
better in that it's not purely random, it also considers the server's load
(it picks the server with the least outstanding request between two random
ones, thus stepping a little bit on leastconn's feet). This is something
that roundrobin cannot consider, by definition.

I've run a few tests to compare roundrobin, leastconn and random. The
tests were run on an AMD EPYC 64-core/128-thread system, connected to one
load generator connected over 25Gbps and 4 servers (25 Gbps each). The
tests consisted in retrieving a 1 kB object from the servers.

The request rate in requests/s, the bandwidth in bits/s, and the CPU usage
of the LB algorithm / locking are reported. The performance ratio between
the algo and roundrobin are reported for each algo (measured in req/s).

First test, using 2 thread groups (64 threads each). It can correspond to
systems started with just thread-groups 2 without other NUMA-specific
directives, and may illustrate to some extents how it would behave on
unified cache CPUs (e.g. intel) with the limitation that locking costs are
generally lower on such systems:

   algo    roundrobin  leastconn  random
   req/s   910k        1.02M      1.06M
   BW      7.8Gbps     8.6Gbps    9.1Gbps
   LB_CPU  48%         14%        1.04%
   ratio   100%        112%       116%

We can see that in this case the locking of the RR algo was still quite
dominant, resulting in less overall performance.

Second test, enabling cpu-policy performance, which starts 8 thread groups
of 16 threads each:

   algo    roundrobin  leastconn  random
   req/s   1.47M       1.3M       1.37M
   BW      12.3Gbps    11.2Gbps   11.6Gbps
   LB_CPU  0.5%        19%        1.55%
   ratio   100%        88%        93%

Here the LB algo has the lowest locking cost thanks to the fact that there
are in fact 8 independent round-robin schedulers running.

Third test, with one of the 4 server's connection was degraded from 25
Gbps to 1 Gbps.

   algo    roundrobin  leastconn  random
   req/s   430k        1.3M       1.3M
   BW      3.66Gbps    11.1Gbps   11.1Gbps
   LB_CPU  0.1%        19%        1.4%
   ratio   100%        302%       302%

This last test, even if cheated by degrading one server's connection,
perfectly illustrates what happens when a server's processing capacity is
not stable over time. It might be taking varying loads, or dealing with
unequal requests, or simply experiencing bursts of garbage collector
activity that slows it down for a few hundreds of milliseconds every few
seconds. roundrobin isn't optimal in this case as it will methodically
assign the request to these servers.  Leastconn is optimal in dealing with
such cases but it's more suited to long-lasting connections, and will not
fairly distribute short-request traffic among servers weighted
differently. For example, if some servers have a slowstart parameter, it
will progressively raise their weight when they're added, but this will
only affect their ability to accept new concurrent requests/connections,
but not directly their share of the traffic.

And here random excels. By default it will make two draws (the number of
draws is the argument and defaults to 2), and compare the server's current
number of connections, to pick the least loaded one. This way if a server
experiences a temporary slowdown, it will keep its current requests longer
and have a higher count than others, making it less eligible to new
requests under the random algorithm.

Currently, the default load balancing algorithm is roundrobin when not
specified. The proposal here in order to improve the situation would be in
3.3 to simply switch to random when not specified, the point being that if
users do not specific the algorithm they want to use, it's that they're
fine with the ones chosen for them by the product's designers. This one
gives a better distribution on irregular farms and uses less CPU across
varying hardware.

Any objection / support ?

Thank you for the insights.

How about to take a look into https://github.com/haproxy/haproxy/issues/1570 to enhance the LB Algorithms?

I personally use almost always Leastconn because I like the ideas to balance between all be servers. But my use-case is quite simple and I have only 2/3 servers normally.

As I refer always to this excellent Blog Post https://www.haproxy.com/blog/power-of-two-load-balancing maybe an update to this test with the current HAProxy Version could be interesting if something was changed?

I refer to this block at the end.

```
As explained in the papers cited, Power of Two was designed as a poor man’s Least Connections to be used in situations where a true Least Connections would be impractical, difficult to implement, or come with a significant performance hit. While it’s an excellent workaround, especially as an alternative to Round Robin, whenever you have the choice, you should still prefer Least Connections since it demonstrates better server response times and reduces the cost of server load and processing power.
```

From my point of view "+1" for random as new default but that will confuse some peoples because random is set in peoples head as "random is bad as it's 'random'" which isn't true in HAPRoxy case.

How about to create an alias like "power-of-two" Algo setting so that the people out there got the idea why "random(2)" isn't that bad in HAProxy case.

Thanks,
Willy

Jm2c

Regards
Aleks


Reply via email to