Re: nbproc>1, ksoftirqd and response time - just can't get it

Willy Tarreau Tue, 26 Jul 2011 22:50:47 -0700

Hi Dmitriy,

On Tue, Jul 26, 2011 at 05:27:10PM +0400, Dmitriy Samsonov wrote:
> > Intel's 10G NICs are fast, but generally hard to tune, you have a hard
> > trade-off between data rate and latency. Basically you tune them for
> > large or small packets but not both. Still for DDoS they might be well
> > suited.
> >
> >
> At least they are well suited for tuning which is only thing one may want
> when something is not working.


Indeed !

> > 10G DDoS is something very hard to resist !
> >
> >
> But here, in the year 2011 it is not that difficult to run such an attack.:(

I agree and that's a real problem.

> We've tested from the outside. In fact that was a real attack. Botnet
> consisted of ~10-12k bots each opening 1000 connections/second. There were
> some failures, when amazon's load balancers responded 'unknown failure', but
> things were getting back to normal in a few minutes. At peak we had 60 nodes
> each running at 50Mbyte/s,

OK so they were opening real connections, not just SYN flooding. SYN flooding
generally requires additional system tuning (eg: reduce the synack_retries and
adjust the max_syn_backlog to the proper level to send SYN cookies).

> if you want I can make a screenshots from
> amazon's console with stats and send it to you.

Oh yes please do, that would be appreciated.

> As to the costs...  running one hour of such ec2 node is $0.38 * 60 = $22/h,
> and traffic is $0.13/Gb - this is the most expensive part.

You mean $0.13 per gigabyte I assume. Indeed this is not cheap. But it can be
a good emergency solution when you're experiencing a nasty DDoS that you can't
deal with on your side.

> > My best results were achieved on Core i5 3.3 GHz (dual core, 4 threads). In
> > short, I bind network IRQs to core zero and haproxy to core one and the two
> > remaining threads ensure there is some rope left for system management,
> > SSH,
> > etc. With this you can run at 100% CPU all the day if you want. I managed
> > to
> > get haproxy to dynamically filter 300k connections per second on such a
> > configuration. That's SYN, SYN/ACK, ACK, RST, based on a source IP's
> > connection
> > rate.
> >
> How may client IPs you were testing? I guess when number of clients reaches
> some high value there could be problems with CPU cache because tables are
> growing?

It's not important because the ACL is matched only once when the connection
is established. Indeed you're right, CPU cache is something very important,
but it's already been thrashed with other data structures for each connection.
In practice, looking up an IP address within a 1 million entries table takes
between 100-300 ns on a recent machine on average. Since 1 million entries do
not fit in the cache anymore, I think it's already fast enough for our usage.

> > Depending on the type of DDoS you're dealing with, it may be worth
> > distributing
> > the NIC's IRQs to multiple cores so that the kernel's work can scale (eg:
> > emit
> > a SYN/ACK). It might even be worth having multiple haproxy processes,
> > because
> > when you're blocking a DDoS, you don't really mind about monitoring, stats,
> > health checks, etc... You only want your machine to be as fast as possible,
> > and
> > doing so is possible with multi-queues. You then need to spread your NICs
> > IRQs
> > to all cores and bind as many haproxy processes as cores (and maually pin
> > them).
> > The CPU-NIC affinity will not be good on the backend but that's not the
> > issue
> > since you're dealing with a frontend where you want to reject most of your
> > traffic.
> >
> >
> That's true. I've ended trying to filter attacks on Dell R510 with multiple
> haproxy and irqbalance running.

irqbalance only makes things worse for low-latency processes. When you move
the IRQs across CPUs only once a second, you can't adapt to such workloads.
It's more suited for heavy tasks which are not much sensible to latency.
It's much better to manually play with the IRQ masks.

> And about correct statistics on multiple
> processes - why not use shared memory to store stats? Will it affect
> performance because transferring memory blocks between cores?

Yes, and more precisely because of locking, because stats are updated at
many places. What I want to do is what I did in my old traffic generator,
it consists in having one memory area for each process and having the
stats dump interface pick and aggregate values from there without locking.
If data are correctly accessed, this is not a problem. Multi-word data
may still need locking though. The main issue which remains is for rate
counters and max thresholds, which still need to be centralized.

Another approach I was thinking about was to have a dedicated task in each
process responsible for pushing the stats. It would always access valid
stats since it's alone when working, and would lock the central ones. Still
the issue of rates and maxes persists.

Since right now it's already possible to individually ask each process for
their stats, it's not something very urgent to change this.

Regards,
Willy

Re: nbproc>1, ksoftirqd and response time - just can't get it

Reply via email to