This is a proposal for a MASSIVE optimization for spamassassin.  It
would solve the issue in the "Overlapping blacklists" thread among many
other things.  Unlike my "meta to rule them all" solution, this one
requires nontrivial dev work, including significant expansion for the
priority system and the current shortcircuit mechanism.

I originally posed this idea in
http://old.nabble.com/Properly-integrating-clamAV-into-SpamAssassin-td23360736.html
 though it was largely ignored as the thread devolved into a matter of
which order to run anti-spam vs anti-virus.

Scans should be run in the following order:

1. Ham rules (local and network)
2. Bayes
3. Local spam rules
4. Slow spam rules (including all network rules)
5. Afters (e.g. AWL*)

Phases 3 and 4 would be broken down into a few steps, each one sorted by
descending score:

a. first-pass rules (rules that do not depend on other rules)
b. held spam rules that no longer depend on un-run rules
c. repeat previous step as needed

A spamd setting will note a few thresholds for when to stop, perhaps
aided by 'shortcircuit' directives like what we currently have (like for
a definitive ham indicator).

Example:

terminate_score 15
body   INEFFICIENT_TEST    /(.{1,20}) (...(this|rule|hurts))+ \1/i
depend INEFFICIENT_TEST    GAPPY_SUBJECT

body   INEFFICIENT_TEST_2  /(.{1,20}) (...(here|we|go|again))+ \1/i
tflags INEFFICIENT_TEST_2  slow

This introduces two new configuration options and a new tflag.  First,
terminate_score, which says that if we're on phase 3 or 4 and the score
exceeds this value, skip to the afters phase.

Second, depend tells the engine that it should only consider running
INEFFICIENT_TEST if the message in question has triggered GAPPY_SUBJECT.
 By the current ruleqa numbers, this means only 0.1550% of all messages
(0.2579% of spam) would be subjected to this painful painful rule, and
since it's not a first-pass rule, there's a chance the scanning has
already stopped due to hitting the terminate_score.

The "slow" tflag tells the engine that this rule should be held until
the slow rules phase (the "net" tflag does this too, but also triggers a
different scoring set).  Anything that depends (by meta or by "depend"
flag) on a rule that isn't slated to run yet is held until that rule
runs, even if that means sliding to a later phase.

Note that spam dependencies of ham rules would have to be pulled into
the ham phase in order for the short-circuiting to work.

By ordering the rules in this manner, I've also reduced the network
lookups to the messages that actually require them.  The fact that
really spammy rules hit almost all DNSBLs won't require actually
discovering that; a message hitting PBL, PSBL, and SBL will have 8.854
points from DNSBLs (plus scores from every local rule hit, hopefully
stopping the scanning before this point) by the time it comes to the
SORBS lookups.

There is almost certainly a way of traversing meta rules and
automatically applying depend lines to their less efficient predicates
(as reported by timing.log).  Ideally, this would automate the whole
process and alleviate the need for the "depend" option and the "slow" tflag.

*AWL is a special case that makes this whole thing harder.  The only way
I can see of properly integrating it is to run it each time the
terminate score is reached.  One cop-out could be to calculate the
maximum "reasonable" negative it could apply and modify the terminate
score accordingly, recalculating it upon hitting the slow phase.  An
even simpler cop-out would be to multiply the terminate score by 1.5 if
AWL has data.

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to