Hoi Willy!

What a detailed and fruitful response, as always

On 27/3/19 10:56 μ.μ., Willy Tarreau wrote:
> Hi Pavlos!
> 
> On Wed, Mar 27, 2019 at 09:57:32PM +0100, Pavlos Parissis wrote:
>> Have you considered enabling SO_INCOMING_CPU socket option in
>> order to increase data locality and CPU cache hits?
> 
> No, really for our use case I'm not convinced at all by it, I'm only
> seeing risks of making things much worse for the vast majority of use
> cases. First, my experience of haproxy+network stack has always shown
> that sharing the same CPU for both is the second worst solution (the
> worst one being to use different sockets on a NUMA system). Indeed,
> haproxy requires quite some CPU and the network stack as well. When
> running in keep-alive it's common to see 50/50 user/system. So when
> both are on the same CPU, each cycle used by the kernel is not
> available to do user stuff an conversely. With processes, I found
> that the best model was to have haproxy on half of the cores (plus
> their respective siblings if HT was in use) and half of the cores
> for the network interrupts. This means that data would flow via the
> L3 cache, which usually is pretty fast (typically 10ns access time)
> and moves only once per direction. Also, haproxy manipulates a lot
> of data and causes lots of L1 cache thrashing, something you
> absolutely don't want to experience in the kernel where you need to
> have a very fast access to your socket hash tables and various other
> structures. Keep in mind that simply memcpy() 16 kB from one socket
> to a buffer and, sending them again to the SSL stack then back to the
> kernel destroys 48kB of cache. My skylake has 32kB of L1... 
> 

OK, I have to admit that I didn't think about thrashing L1 cache. My bad.

> Placing the network on the first thread of all cores and haproxy on
> the second thread of all cores was sometimes better for SSL since
> you can have more real cores for crypto code, but then you have the
> fact that you still trash the L1d a lot, and quite a part of the L1i
> as well, both of which are shared between the two siblings, with
> apparently varying limits that manage to protect both from the other
> one to some extents. In general when doing this you'd observer a lower
> max connection rate but a higher SSL rate.
> 
> Nowadays with threads I'm seeing that running haproxy and the system
> on siblings of the same core doesn't have as negative an impact as it
> used to. One reason certainly is that threads share the same code and
> the same data, and that by having all this code and data readily
> available in L2, L1 can quickly be refilled (3 ns).
> 
> The other thing is that if you are certain to control your network
> card's affinity (i.e. no single-flow attack will keep a single CPU
> busy), then it's reasonable to let haproxy run on all cores/threads
> and the same for the network because the level of performance you
> can reach is so much indecent that the comfort provided by such a
> simple and adaptable configuration completely overweights the losses
> caused by bad cache interactions since you'll hardly ever need to
> see the two compete.

This is the setup we ended up, but with HT disabled and having the first
two CPUs handling resources from side-cars(rsyslog and etc), so haproxy
uses 12 out of 14 CPUs while IRQ queues are handled by CPUs. During
peak traffic CPU utilization for SoftIrq level never goes above 6% and
only during stress testing we noticed usage close to 40%. As you rightly
wrote simplicity matters and we opted for simplicity.

> 
> Now to get back to the socket vs CPU affinity, the only case where
> it would not have a negative effect is when each thread is bound to
> a single CPU and your NIC is perfectly bound with one queue per CPU
> as well. 

But, on different CPUs, otherwise we go back to the situation of thrashing CPU
cache.

> First, you get back to the very painful configuration that
> can regularly break (we've seen NICs lose their bindings on link
> reset for example), and in this case when everything is bound by
> hand you're back to the risk of a single-flow attack that fills one
> core from a trivial ACK flood on a single TCP stream.
> 

Above is a general problem and irqbalancer can solve it, but irqbalaner
has proven almost always to take the wrong decision. I believe the majority of
people have it disabled and use either affinity scripts from NIC provider
or custom code in a configuration system(puppet and etc).

> In all other cases, cpu-map missing, "process" missing on the bind
> lines or more than one thread referenced, RPS enabled on the system
> to be more resistant to attacks, or simply some intense traffic on
> the network side preventing haproxy from making good progress, well,
> you can have lots of bad surprises either by assigning the incoming
> connection to the wrong thread (possibly already overloaded), or by
> trying to stick the socket to the current CPU which might not be
> the one you'd like to use for the network part.
> 

Fair enough.

> I'm all for experimenting with SO_INCOMING_CPU and figure if there
> exist cases where it helps (I'm certain it does for packet capture
> like in IDS systems for example) but here I have strong doubts. If
> it appears that it does help for certain situations then we may
> figure how we want to make it configurable.
> 

To sum up this very detailed and full of vital observations/information.
If the application does quite a bit of memory copying then you lose the benefits
of CPU caches due to thrashing that takes place, then SO_INCOMING_CPU could
have negative impact.

Thanks a lot Willy for your response, I very much appreciate the time you took 
to
reply,

Pavlos

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to