Re: [Q] connection tracking scaling

Jean-Michel Hemstedt Tue, 19 Mar 2002 14:42:12 -0800

----- Original Message -----
From: "Patrick Schaaf" <[EMAIL PROTECTED]>
To: "Jean-Michel Hemstedt" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Tuesday, 19 March, 2002 17:42
Subject: Re: [Q] connection tracking scaling



> Hello Jean-Michel,
>
> thanks for your input. I appreciate it.
>
> On Tue, Mar 19, 2002 at 03:56:32PM +0100, Jean-Michel Hemstedt wrote:
> >
> > I'm not a conntrack specialist, neither a kernel hacker, but I've
> > some experience with ip hash caches in access servers (BRAS)
> > that may be useful(?):
> >
> > some additional stats:
> > - HDA: cache hit depth average: the number of iterations in the
> >   bucket's list to get the matching collision entry.
> > - MDA: cache miss depth average: the number of iterations required
> >   without matching a cache entry (new connection).
>
> These two measures make an excellent dynamic measure in addition to
> what I described. However, they require modification of the standard
> LIST_FIND() macro used when traversing the chain, which is a bit
> awkward - and they require keeping properly locked / cpu-local
> counters in the critical path. I wouldn't want this on a highly
> loaded machine, at least not for a longer time.
>
everything is relative: if the overhead of maintaining those local counters
is too big compared to the exploit we can do of them (col prom, coarser
stats,
path switching,...) then just forget about it.

> The measures are important because of their dynamical nature. No static
> looking at the chains can capture effects related to the _order_ of the
> lists. These measures can. One possible optimization for the future,
> would be to move the bucket pointer to the "last found" element, giving
> an easy 1-off-cache per bucket. The effect of that cannot be measured
> with a static examination; it should be clearly visible under your
> dynamic measure.
>
> I'll see that I make this "seperately optional".
>

of course, each box as it's own requirements. A configurable option
in such a generic kernel framework is always better than a built-in
*useable* option.

> > some questions:
> > - have you an efficicent 'freelist' implementation? What I've seen
about
> >   kmem_cache_free and kmem_cache_alloc doesn't look like a simple
> >   pointer dereference... Am I wrong?
>
> The kmem_cache() stuff is supposed to be the best possible implementation
> of such a freelist on an SMP system. I am not aware that it has
inefficiencies.
>
and surely me neither ;o)

> > - wouldn't it be worth to have a "cache promotion" mechanism?
>
> What do you mean with that?

(you wiped out my explanation in my previous mail).
It's the mechanism by which the tuples in a collision list are resorted
by hit frequency, trying to reduce the number of iterations required to
get a match. It's only usefull if you insert new tuples at the beginning
of the list, or if you have let's say a control channel and a data channel
in the same list. In the later example, if the entry corresponding to the
data
channel is after the entry corresponding to your control channel, then you
waist one iteration per data packet.
example? stream A=1pps, stream B=100pps, lifetime of A = lifetime of B = T
=> if A after B, num iterations = T*(1*100 + 2*1) = T*(102)
=> if B after A, num iterations = T*(1*1 + 2*100) = T*(201)
But this mechanism requires a per tuple hit counter, and some additional
checks and pointer reallocation in the fast path... This is thus useful in
specific cases (i.e: a media gateway, sip/h323 proxy, real-time streaming
env.)

>
> > regarding [hashsize=conntrack_max/2], I vote for!
> > An alternate solution would be to have a dynamic hash resize
> > each time the average number of collisions exceeds a treshold
> > (and no down resize, except maybe asynchroneously).
>
> This is REALLY AWFUL in an SMP setting, because you need to keep the
> whole system write locked while you rebuild the hashes. You really
> don't want to do that.

ok, ok, I haven't said anything ;o)... but FYI the "prime double hash" is
supposed to address the problem of smooth hash expansion by moving
gradually the entries from one to the other (i've not tried it).

>
> > One last word: the hash function you're using is the best compromise
> > between unpredictable ipv4 traffic, cache symetry, uniformity and
> > computation time.
>
> I'm pretty ignorant wrt theory and attributes of hash functions,
> so I'm damned to check this by experiment. But it's reassuring to
> hear you say the function is basically OK.
>
thanks to the one who choosed it (Rusty?).I've got only
one remark: modulo operation consumes lots of CPU cycles especially
if the operand is prime! and its implementation is architecture
dependent (so not predictable) wheras bit shift operations are totally
controlable and give pretty good results if used correctly... Bitshift has
also the advantage of being able to define a hashsize=2^n.
I memory is not an issue and if CPU cycles are critical ($/Hz >> $/MB)
then we'd better use a bitshift hash function. (personally this was the
one I used). Anyway, I would be interested by bench results on different
platforms... (but again, we are tweaking bits and cycles of one line
forgetting the big mess outside ;o)

> > I wouldn't change it too much, but there are two
> > propositions possible:
> > - if you keep the modulo method (%), use a prime number far from a
> >   power of 2 for 'ip_conntrack_htable_size'.
>
> Hmm. Can you give a short idea of why "far from a power of 2" is
important?
> In a context completely unrelated to iptables, I had good results using
> the dynamically growing array approach you described (this was userlevel
> batch jobs, don't care about latency, only throughput). I precalculated
> the largest prime just below any power of 2, and used that as the array
> size. Works out very well.
>
well. Knuth says so, and I'm quite confident with what he says ;o)
First, this rule applies when the nature of the key is unknown and
its distribution unpredictable (our case). Secondly, the detailed rule
would be in our case "prime far from 2^8. Third, the obscure
mathematical reason bihind is far beyond my competencies, but
it seems to come from pattern shifting. In case of natting and ipv4,
it seems that the pattern repetition may most oftenly occur at byte level.

If anybody else as a better explanation, please help us.


> I agree that, since we already use a full division when calculating
> the hash function, we may as well use a
prime
>  hashsize. This will
> waste some room in the last OS page of the array, but that's irrelevant
> given the overall size of the array.
>
yes, but unfortunately, modulo, and especially prime modulo is quite
expensive.

> best regards
>   Patrick
>

I understood some time ago why people were able to write books fully
dedicated to hash functions... This is an endless discussion...
but it's amazing to see how one single line of code can kill or boost
your whole code!!!  ;o)

Re: [Q] connection tracking scaling

Reply via email to