Hi Pavlos! On Wed, Mar 27, 2019 at 09:57:32PM +0100, Pavlos Parissis wrote: > Have you considered enabling SO_INCOMING_CPU socket option in > order to increase data locality and CPU cache hits?
No, really for our use case I'm not convinced at all by it, I'm only seeing risks of making things much worse for the vast majority of use cases. First, my experience of haproxy+network stack has always shown that sharing the same CPU for both is the second worst solution (the worst one being to use different sockets on a NUMA system). Indeed, haproxy requires quite some CPU and the network stack as well. When running in keep-alive it's common to see 50/50 user/system. So when both are on the same CPU, each cycle used by the kernel is not available to do user stuff an conversely. With processes, I found that the best model was to have haproxy on half of the cores (plus their respective siblings if HT was in use) and half of the cores for the network interrupts. This means that data would flow via the L3 cache, which usually is pretty fast (typically 10ns access time) and moves only once per direction. Also, haproxy manipulates a lot of data and causes lots of L1 cache thrashing, something you absolutely don't want to experience in the kernel where you need to have a very fast access to your socket hash tables and various other structures. Keep in mind that simply memcpy() 16 kB from one socket to a buffer and, sending them again to the SSL stack then back to the kernel destroys 48kB of cache. My skylake has 32kB of L1... Placing the network on the first thread of all cores and haproxy on the second thread of all cores was sometimes better for SSL since you can have more real cores for crypto code, but then you have the fact that you still trash the L1d a lot, and quite a part of the L1i as well, both of which are shared between the two siblings, with apparently varying limits that manage to protect both from the other one to some extents. In general when doing this you'd observer a lower max connection rate but a higher SSL rate. Nowadays with threads I'm seeing that running haproxy and the system on siblings of the same core doesn't have as negative an impact as it used to. One reason certainly is that threads share the same code and the same data, and that by having all this code and data readily available in L2, L1 can quickly be refilled (3 ns). The other thing is that if you are certain to control your network card's affinity (i.e. no single-flow attack will keep a single CPU busy), then it's reasonable to let haproxy run on all cores/threads and the same for the network because the level of performance you can reach is so much indecent that the comfort provided by such a simple and adaptable configuration completely overweights the losses caused by bad cache interactions since you'll hardly ever need to see the two compete. Now to get back to the socket vs CPU affinity, the only case where it would not have a negative effect is when each thread is bound to a single CPU and your NIC is perfectly bound with one queue per CPU as well. First, you get back to the very painful configuration that can regularly break (we've seen NICs lose their bindings on link reset for example), and in this case when everything is bound by hand you're back to the risk of a single-flow attack that fills one core from a trivial ACK flood on a single TCP stream. In all other cases, cpu-map missing, "process" missing on the bind lines or more than one thread referenced, RPS enabled on the system to be more resistant to attacks, or simply some intense traffic on the network side preventing haproxy from making good progress, well, you can have lots of bad surprises either by assigning the incoming connection to the wrong thread (possibly already overloaded), or by trying to stick the socket to the current CPU which might not be the one you'd like to use for the network part. I'm all for experimenting with SO_INCOMING_CPU and figure if there exist cases where it helps (I'm certain it does for packet capture like in IDS systems for example) but here I have strong doubts. If it appears that it does help for certain situations then we may figure how we want to make it configurable. Cheers, Willy

