Hi Pavlos!

On Wed, Mar 27, 2019 at 09:57:32PM +0100, Pavlos Parissis wrote:
> Have you considered enabling SO_INCOMING_CPU socket option in
> order to increase data locality and CPU cache hits?

No, really for our use case I'm not convinced at all by it, I'm only
seeing risks of making things much worse for the vast majority of use
cases. First, my experience of haproxy+network stack has always shown
that sharing the same CPU for both is the second worst solution (the
worst one being to use different sockets on a NUMA system). Indeed,
haproxy requires quite some CPU and the network stack as well. When
running in keep-alive it's common to see 50/50 user/system. So when
both are on the same CPU, each cycle used by the kernel is not
available to do user stuff an conversely. With processes, I found
that the best model was to have haproxy on half of the cores (plus
their respective siblings if HT was in use) and half of the cores
for the network interrupts. This means that data would flow via the
L3 cache, which usually is pretty fast (typically 10ns access time)
and moves only once per direction. Also, haproxy manipulates a lot
of data and causes lots of L1 cache thrashing, something you
absolutely don't want to experience in the kernel where you need to
have a very fast access to your socket hash tables and various other
structures. Keep in mind that simply memcpy() 16 kB from one socket
to a buffer and, sending them again to the SSL stack then back to the
kernel destroys 48kB of cache. My skylake has 32kB of L1... 

Placing the network on the first thread of all cores and haproxy on
the second thread of all cores was sometimes better for SSL since
you can have more real cores for crypto code, but then you have the
fact that you still trash the L1d a lot, and quite a part of the L1i
as well, both of which are shared between the two siblings, with
apparently varying limits that manage to protect both from the other
one to some extents. In general when doing this you'd observer a lower
max connection rate but a higher SSL rate.

Nowadays with threads I'm seeing that running haproxy and the system
on siblings of the same core doesn't have as negative an impact as it
used to. One reason certainly is that threads share the same code and
the same data, and that by having all this code and data readily
available in L2, L1 can quickly be refilled (3 ns).

The other thing is that if you are certain to control your network
card's affinity (i.e. no single-flow attack will keep a single CPU
busy), then it's reasonable to let haproxy run on all cores/threads
and the same for the network because the level of performance you
can reach is so much indecent that the comfort provided by such a
simple and adaptable configuration completely overweights the losses
caused by bad cache interactions since you'll hardly ever need to
see the two compete.

Now to get back to the socket vs CPU affinity, the only case where
it would not have a negative effect is when each thread is bound to
a single CPU and your NIC is perfectly bound with one queue per CPU
as well. First, you get back to the very painful configuration that
can regularly break (we've seen NICs lose their bindings on link
reset for example), and in this case when everything is bound by
hand you're back to the risk of a single-flow attack that fills one
core from a trivial ACK flood on a single TCP stream.

In all other cases, cpu-map missing, "process" missing on the bind
lines or more than one thread referenced, RPS enabled on the system
to be more resistant to attacks, or simply some intense traffic on
the network side preventing haproxy from making good progress, well,
you can have lots of bad surprises either by assigning the incoming
connection to the wrong thread (possibly already overloaded), or by
trying to stick the socket to the current CPU which might not be
the one you'd like to use for the network part.

I'm all for experimenting with SO_INCOMING_CPU and figure if there
exist cases where it helps (I'm certain it does for packet capture
like in IDS systems for example) but here I have strong doubts. If
it appears that it does help for certain situations then we may
figure how we want to make it configurable.

Cheers,
Willy

Reply via email to