https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6508
Summary: Speeding up lookups on {trusted,internal,msa}_networks
Product: Spamassassin
Version: SVN Trunk (Latest Devel Version)
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: Libraries
AssignedTo: [email protected]
ReportedBy: [email protected]
Prompted by the current brokenness in NetAddr::IP 4.034 and 4.035
(Bug 6507) and after Philip Prindeville pointed out that a
Net::Patricia module might be a useful replacement or addition,
I have instrumented the module Mail::SpamAssassin::NetSet with
timing reports, and later added the use of Net::Patricia to it
- with very encouraging (and worrying) results.
Btw, the Net::Patricia module implements radix trie lookups, which are
well suited for fixed size keys (such as 128 bits of a network address)
and does a lookup in constant time O(k), proportional only to a size
of a network address (fixed 128 bits), and independent of a size of
the trie. As such it is commonly used in IP routing, efficiently
finding a closest match from a set of available routes.
See:
http://en.wikipedia.org/wiki/Patricia_trie
Here are some benchmarking results measured on real life mail,
on an AMD quad core Opteron 2376:
Elapsed time PER MESSAGE spent in NetSet lookups (sub contains_ip)
for a set of n networks (using about 15 DNS white-/black lists):
nets current
---- --------
27 22 ms
1000 500 ms
7000 3700 ms
Elapsed time for looking up ONE IP ADDRESS in a set of n networks:
nets current patricia trie
---- -------- -------------
27 1.5 ms 0.07-0.18 ms
1000 34 ms 0.10-0.20 ms
7000 250 ms 0.10-0.20 ms
See also:
Bug 5931 - add_cidr chokes on large amount of trusted_networks
Conclusion 1: for anything above 4 or so networks it pays off
to use Net::Patricia lookups.
Conclusion 2: the Plugin/DNSEval.pm performs one lookup on the
trusted_networks list for each DNSBL response, all of these are
lookups on THE SAME IP address. It clearly calls for some caching
of NetSet lookup results within processing of each message.
Now, there is one caveat with using a Patricia trie for lookups:
it finds the tightest match on listed CIDR networks - unlike our
present sequential search, which returns the first match in the
order of declared networks. For all practical purposes on valid
data this makes no difference. The difference pops up on overlapping
conflicting networks. Some of these are reported by lint, but not
all. For this reason a couple of t/trust_path.t tests currently
fail when Net::Patricia is installed and NetSet decides to use it.
If Net::Patricia is not installed, the NetSet just uses the old
sequential search on a list of NetAddr::IP objects.
Guessing that Hudson does not have the Net::Patricia installed,
I'd just roll in the change to trunk for ease of assessment,
and can back off patricia or change the trust_path.t tests later
on when we decide what would be the best. Would that be alright?
--
Configure bugmail:
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.