Re: [Q] connection tracking scaling

Jean-Michel Hemstedt Tue, 19 Mar 2002 07:52:53 -0800

----- Original Message -----
From: "Patrick Schaaf" <[EMAIL PROTECTED]>
To: "Harald Welte" <[EMAIL PROTECTED]>; "Patrick Schaaf" <[EMAIL PROTECTED]>;
"Martin Josefsson" <[EMAIL PROTECTED]>; "Aviv Bergman"
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Tuesday, 19 March, 2002 12:16
Subject: Re: [Q] connection tracking scaling



> > I'd rather like to have this information to be gathered at runtime
within
> > the kernel, where one could read out the current hash occupation via
/proc
> > or some ioctl.
>
> OK, that's what I wanted to hear :-)
>
> Actually, the interesting statistics for a hash are not that large, and
all
> aggregate:
>
> - bucket occupation: number of used buckets, vs. number of all buckets
> - average chain length over all buckets
> - average chain length over the used buckets
> - counting of a classification of the chain lengths:
> - number of 0-entry buckets
> - number of 1-entry buckets
> - number of 2-entry buckets
> - number of 4-entry buckets
> - number of 8-entry buckets
> - number of 16-entry buckets
> - number of more-than-16-entry buckets
>
> That's 10 values, and will at most double when I think more about it.
> I propose to gather these stats on the fly, and simply printk() them
> at a chosen interval:

I'm not a conntrack specialist, neither a kernel hacker, but I've
some experience with ip hash caches in access servers (BRAS)
that may be useful(?):

some additional stats:
- HDA: cache hit depth average: the number of iterations in the
  bucket's list to get the matching collision entry.
- MDA: cache miss depth average: the number of iterations required
  without matching a cache entry (new connection).

HDA is meaningful if you have a bad cache distribution or a small
CIS/CTS ratio (Cache Index size=number of hash buckets / Cache
total size=total number of conntrack tuples cachable). It also provides
good information on traffic type and cache efficiency: In fact, lets
assume you have realtime traffic (RTP) and bursty traffic (HTTP/1.1
with keep alive) at the same time, and that the tuples for both type
of traffic are under the same hash key. Now if your RT tuple is at the
end of the collision list, or after the bursty entries, you will need
frequent extra iterations to get your RT tuple... The work around
for that is "collision promotion": you keep a hit counter in each tuple
and just swap one position ahead the most frequently accessed tuple.

some questions:
- have you an efficicent 'freelist' implementation? What I've seen about
  kmem_cache_free and kmem_cache_alloc doesn't look like a simple
  pointer dereference... Am I wrong?
- wouldn't it be worth to have a "cache promotion" mechanism?

regarding [hashsize=conntrack_max/2], I vote for!
An alternate solution would be to have a dynamic hash resize
each time the average number of collisions exceeds a treshold
(and no down resize, except maybe asynchroneously).
But given my experience I would say that ip hash distribution is not at all
predictable (unless you know where in the net path your box will be, and
what traffic type (VoIP, HTTP, eDonkey, ...) your box will have to handle,
and even then, your predictions will not be valid for more than 6 month!).
Therefore, the common way to handle unpredictable distribution is to
define:
[ max hash index size >= max number of cache tuples]
with a dynamic hash index resize....

One last word: the hash function you're using is the best compromise
between unpredictable ipv4 traffic, cache symetry, uniformity and
computation time. I wouldn't change it too much, but there are two
propositions possible:
- if you keep the modulo method (%), use a prime number far from a
  power of 2 for 'ip_conntrack_htable_size'.
- if modulo is too slow, use the bitmasking method (&) with hsize being
 a power of 2, and with 2 bitshifts ((key+key>>20+key>>12) & hsize), but
  this method is not as efficient as the modulo method, and must be
  reconsidered for ipv6.

hope this may help...

>
> echo 300 >/proc/net/ip_conntrack_showstat
>
> would generate one printk() every 300 seconds. Echoing 0 would disable
> the statistics gathering altogether.
>
> I think I can hack this up, today. Having the flu must be good for
something...
>
> later
>   Patrick
>
>

Re: [Q] connection tracking scaling

Reply via email to