This is a proposal for a MASSIVE optimization for spamassassin. It would solve the issue in the "Overlapping blacklists" thread among many other things. Unlike my "meta to rule them all" solution, this one requires nontrivial dev work, including significant expansion for the priority system and the current shortcircuit mechanism.
I originally posed this idea in http://old.nabble.com/Properly-integrating-clamAV-into-SpamAssassin-td23360736.html though it was largely ignored as the thread devolved into a matter of which order to run anti-spam vs anti-virus. Scans should be run in the following order: 1. Ham rules (local and network) 2. Bayes 3. Local spam rules 4. Slow spam rules (including all network rules) 5. Afters (e.g. AWL*) Phases 3 and 4 would be broken down into a few steps, each one sorted by descending score: a. first-pass rules (rules that do not depend on other rules) b. held spam rules that no longer depend on un-run rules c. repeat previous step as needed A spamd setting will note a few thresholds for when to stop, perhaps aided by 'shortcircuit' directives like what we currently have (like for a definitive ham indicator). Example: terminate_score 15 body INEFFICIENT_TEST /(.{1,20}) (...(this|rule|hurts))+ \1/i depend INEFFICIENT_TEST GAPPY_SUBJECT body INEFFICIENT_TEST_2 /(.{1,20}) (...(here|we|go|again))+ \1/i tflags INEFFICIENT_TEST_2 slow This introduces two new configuration options and a new tflag. First, terminate_score, which says that if we're on phase 3 or 4 and the score exceeds this value, skip to the afters phase. Second, depend tells the engine that it should only consider running INEFFICIENT_TEST if the message in question has triggered GAPPY_SUBJECT. By the current ruleqa numbers, this means only 0.1550% of all messages (0.2579% of spam) would be subjected to this painful painful rule, and since it's not a first-pass rule, there's a chance the scanning has already stopped due to hitting the terminate_score. The "slow" tflag tells the engine that this rule should be held until the slow rules phase (the "net" tflag does this too, but also triggers a different scoring set). Anything that depends (by meta or by "depend" flag) on a rule that isn't slated to run yet is held until that rule runs, even if that means sliding to a later phase. Note that spam dependencies of ham rules would have to be pulled into the ham phase in order for the short-circuiting to work. By ordering the rules in this manner, I've also reduced the network lookups to the messages that actually require them. The fact that really spammy rules hit almost all DNSBLs won't require actually discovering that; a message hitting PBL, PSBL, and SBL will have 8.854 points from DNSBLs (plus scores from every local rule hit, hopefully stopping the scanning before this point) by the time it comes to the SORBS lookups. There is almost certainly a way of traversing meta rules and automatically applying depend lines to their less efficient predicates (as reported by timing.log). Ideally, this would automate the whole process and alleviate the need for the "depend" option and the "slow" tflag. *AWL is a special case that makes this whole thing harder. The only way I can see of properly integrating it is to run it each time the terminate score is reached. One cop-out could be to calculate the maximum "reasonable" negative it could apply and modify the terminate score accordingly, recalculating it upon hitting the slow phase. An even simpler cop-out would be to multiply the terminate score by 1.5 if AWL has data.
signature.asc
Description: OpenPGP digital signature
