Hoi Willy! What a detailed and fruitful response, as always
On 27/3/19 10:56 μ.μ., Willy Tarreau wrote: > Hi Pavlos! > > On Wed, Mar 27, 2019 at 09:57:32PM +0100, Pavlos Parissis wrote: >> Have you considered enabling SO_INCOMING_CPU socket option in >> order to increase data locality and CPU cache hits? > > No, really for our use case I'm not convinced at all by it, I'm only > seeing risks of making things much worse for the vast majority of use > cases. First, my experience of haproxy+network stack has always shown > that sharing the same CPU for both is the second worst solution (the > worst one being to use different sockets on a NUMA system). Indeed, > haproxy requires quite some CPU and the network stack as well. When > running in keep-alive it's common to see 50/50 user/system. So when > both are on the same CPU, each cycle used by the kernel is not > available to do user stuff an conversely. With processes, I found > that the best model was to have haproxy on half of the cores (plus > their respective siblings if HT was in use) and half of the cores > for the network interrupts. This means that data would flow via the > L3 cache, which usually is pretty fast (typically 10ns access time) > and moves only once per direction. Also, haproxy manipulates a lot > of data and causes lots of L1 cache thrashing, something you > absolutely don't want to experience in the kernel where you need to > have a very fast access to your socket hash tables and various other > structures. Keep in mind that simply memcpy() 16 kB from one socket > to a buffer and, sending them again to the SSL stack then back to the > kernel destroys 48kB of cache. My skylake has 32kB of L1... > OK, I have to admit that I didn't think about thrashing L1 cache. My bad. > Placing the network on the first thread of all cores and haproxy on > the second thread of all cores was sometimes better for SSL since > you can have more real cores for crypto code, but then you have the > fact that you still trash the L1d a lot, and quite a part of the L1i > as well, both of which are shared between the two siblings, with > apparently varying limits that manage to protect both from the other > one to some extents. In general when doing this you'd observer a lower > max connection rate but a higher SSL rate. > > Nowadays with threads I'm seeing that running haproxy and the system > on siblings of the same core doesn't have as negative an impact as it > used to. One reason certainly is that threads share the same code and > the same data, and that by having all this code and data readily > available in L2, L1 can quickly be refilled (3 ns). > > The other thing is that if you are certain to control your network > card's affinity (i.e. no single-flow attack will keep a single CPU > busy), then it's reasonable to let haproxy run on all cores/threads > and the same for the network because the level of performance you > can reach is so much indecent that the comfort provided by such a > simple and adaptable configuration completely overweights the losses > caused by bad cache interactions since you'll hardly ever need to > see the two compete. This is the setup we ended up, but with HT disabled and having the first two CPUs handling resources from side-cars(rsyslog and etc), so haproxy uses 12 out of 14 CPUs while IRQ queues are handled by CPUs. During peak traffic CPU utilization for SoftIrq level never goes above 6% and only during stress testing we noticed usage close to 40%. As you rightly wrote simplicity matters and we opted for simplicity. > > Now to get back to the socket vs CPU affinity, the only case where > it would not have a negative effect is when each thread is bound to > a single CPU and your NIC is perfectly bound with one queue per CPU > as well. But, on different CPUs, otherwise we go back to the situation of thrashing CPU cache. > First, you get back to the very painful configuration that > can regularly break (we've seen NICs lose their bindings on link > reset for example), and in this case when everything is bound by > hand you're back to the risk of a single-flow attack that fills one > core from a trivial ACK flood on a single TCP stream. > Above is a general problem and irqbalancer can solve it, but irqbalaner has proven almost always to take the wrong decision. I believe the majority of people have it disabled and use either affinity scripts from NIC provider or custom code in a configuration system(puppet and etc). > In all other cases, cpu-map missing, "process" missing on the bind > lines or more than one thread referenced, RPS enabled on the system > to be more resistant to attacks, or simply some intense traffic on > the network side preventing haproxy from making good progress, well, > you can have lots of bad surprises either by assigning the incoming > connection to the wrong thread (possibly already overloaded), or by > trying to stick the socket to the current CPU which might not be > the one you'd like to use for the network part. > Fair enough. > I'm all for experimenting with SO_INCOMING_CPU and figure if there > exist cases where it helps (I'm certain it does for packet capture > like in IDS systems for example) but here I have strong doubts. If > it appears that it does help for certain situations then we may > figure how we want to make it configurable. > To sum up this very detailed and full of vital observations/information. If the application does quite a bit of memory copying then you lose the benefits of CPU caches due to thrashing that takes place, then SO_INCOMING_CPU could have negative impact. Thanks a lot Willy for your response, I very much appreciate the time you took to reply, Pavlos
signature.asc
Description: OpenPGP digital signature

