Am 2022-05-13 08:37, schrieb [email protected]:
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

Henrik Krohns <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #9 from Henrik Krohns <[email protected]> ---
Feel free to discuss here or on dev-list if something is still unclear.

First of all, I would like to thank Henrik for all the work he has done on SpamAssassin. The further I get with my code review, the more I see all the places where he has made minor or major changes that have made the code faster, more readable and thus more maintainable.

Despite all these changes, however, I still see room for improvement. At the moment I'm not very happy about how the do_meta_tests subroutine is implemented. It seems to work now, but the query for the dependency of the meta rules looks too complicated to me:

foreach my $r (@{$md->{$rulename}|[]}) {
  next RULE unless exists $h->{$r} && !$tp->{$r} && !$pl{$r};
}

Instead of three dependencies, there should really only be one query at this point for "is the rule ready or not". If we knew exactly the state of a rule at each point in time, then it would also be possible to move from a brute force algorithm to a deterministic algorithm, where a rule that becomes ready automatically sets all meta rules to ready if that rule was the last dependency analogous to the algorithm for tags. do_meta_tests would then simply process a queue of ready meta rules. If a new meta rule is ready it will be added to the end of the queue. If the queue is empty, all rules of a priority class are processed.

The next possible step would be the implementation of short-circuit behavior of && and || analogous to Perl itself. E.g. the rule

meta BITCOIN_SPAM_10 __BITCOIN_ID && ( HTML_IMAGE_ONLY_04 || HTML_IMAGE_ONLY_08 )

would immediately evaluate to false if __BITCOIN_ID is false. HTML_IMAGE_ONLY_04 and HTML_IMAGE_ONLY_08 would then no longer need to be evaluated.

Recently the question was asked why Check.pm is a plugin if it is not optional. Check.pm is a plugin so that you can implement more than one check plugin. Currently SpamAssassin still assumes that filtering happens postqueue, where you can respond to the different wishes of individual users. If you use SpamAssassin in a prequeue filtering, then only the decision rejection or acceptance is possible for ALL recipients. A consideration of different recipient wishes is not possible.

Due to the possibility to switch between admin and user rules per evaluation of an email, one accepts that the evaluation of the rules must be rebuilt for each email. What we need instead is the possibility to build the rules once at the start of the daemon and then use it to evaluate any number of emails.

You can go one step further and first run a SpamAssassin instance with a lightweight ruleset in prequeue mode to be able to decide as quickly as possible whether to reject an email and then run another instance with a heavyweight ruleset that makes a more precise distinction between marking the email as spam or ham. The second instance can then also evaluate user rules. There should be a feedback loop between the instances so that the first instance can use the results of the second instance, e.g. to quickly reject emails via a local blocklist using the HashBL.pm plugin. Currently we use a feedback loop between the SpamAssassin instance in prequeue mode and Postfix to (temporarily) reject emails for 24 hours.

In any case, SpamAssassin should arrive in 2022 and offer a highly optimized version of the analysis for MTAs that process millions of emails per day.

These are my general remarks about the evaluation of rules. An email with some minor cosmetic changes will follow.

Michael

Reply via email to