Re: cpu-map and definition of LONGBITS

Willy Tarreau Thu, 16 Aug 2018 11:16:38 -0700

Hi J,

On Wed, Aug 15, 2018 at 04:56:10PM +0200, J. Kendzorra wrote:
> Hi,
> 
> I was looking into steering HAProxy to some underutilized cores on a multi
> processor system where both network interrupts as well as HAProxy are
> supposed to run on a specific numa node; the initial idea was to have
> interrupts handled on cores 0-21 and give HAProxy cores 44-65:
> # cat /sys/class/net/em3/device/local_cpulist
> 0-21,44-65
> 
> which could result in a cpu-map setting like this:
> cpu-map all 44-65
> 
> However, that won't work since this setting is restricted to be no higher
> than LONGBITS which is defined as ((unsigned int)sizeof(long) * 8). Looking
> into clues to understand this better I found
> https://marc.info/?l=haproxy&m=144252098801440&w=2 which states:
> 
> ,--
> Let's stay on this version for now. We've been facing a lot of issues in
> the past due to over-engineering, so let's wait for users to complain
> that 64 is not enough and only then we'll address that and we'll know
> why we do it.
> `--
> 
> This isn't really an issue and I'm not asking to change this; I'd just like
> to understand better the reason why LONGBITS being 64 is the maximum for
> cpu-map.


It's being done like this because we perform all thread mask operations on
a "long" type. This has a lot of benefits, the most important one being to
allow to perform all these updates and computations using atomic operations
which are very fast.

> Given that multi processor systems (with multi threading enabled)
> nowadays easily have >64 cores I'd imagine others may run into this as well
> and look for further clues.

Quite frankly, at the moment (in 2018) I see little value in using that many
threads or even processes. Systems' scalability is far from being linear.
Worse, using a NUMA system for anything proxy-related is the best way to
destroy your performance due to the huge latency between the CPU sockets
and the important impact of cache line flushing for shared data. The only
real reason for using many cores is for TLS handshake processing. But even
there you will find a lot of wasted computation power because the more cores
you have, the lower their frequency and the longest it takes to generate a
single key. And when you start to count in multiples of 50K keys/s, you'll
see that passing everything over shared busses and using a shared TCP session
table over NUMA becomes a huge problem again.

Now it should be technically possible to combine nbproc and nbthread.
The code was designed for this though it purposely doesn't offer you the
greatest flexibility regarding the possibilities to configure process+thread
affinity. But using this it should theorically allow you to scale up to 4096
threads (which is totally ridiculous).

Just to give you some reference numbers, in the past I managed to reach
60 Gbps of forwarded traffic using a 4-core system, and a bit more recently
I reached 520k connections/s using only 8 processes. I'd claim that anyone
running above such numbers should seriously think about spreading the load
over multiple boxes because a single failure starts to cost a lot. So barring
SSL, the number of available CPUs will not bring anything really useful for
what we do. Also, when processing clear text traffic, you'll need roughly as
many CPU cores for the network stack as for haproxy. So even if you had, say,
a 80 cores CPU in a single socket that you wanted to use for very high
bandwidth, you'd still configure 40 threads for haproxy and 40 for the
network.

Just my two cents,
willy

Re: cpu-map and definition of LONGBITS

Reply via email to