Hi all. FOREWORD: After using MailScanner for about three years, I finally switched to amavisd-new because I liked some things in the implementation better. One thing I really need is a "decent" DSPAM integration. I do not consider the current DSPAM integration in amavisd-new as sufficient as it is currently only a second scanner besides SpamAssasin, uses only one global user, does not enable easy spam filter retraining by users and the end result of the spam check is calculated by some hard coded rules.
Therefore I did some modifications to amavisd-new 2.3.3 in order to get the features I needed. This first patch was very intrusive, removed much functionality and I had to make some decisions which are fundamental to the way amavisd-new is operating. After a bit of thinking I took a step back, now trying to solve on problem a time without loosing much functionality. BASIC CONSIDERATIONS: I identified five main spots that should be improved before one can use DSPAM in amavisd-new as mentioned above: 1. Plugin interface: In order to support different spam scanners without bloating the amavisd-new code too much, I think there should be defined plugin interface so that adding scanners is easy. 2. Definition of spaminess: Currently the SpamAssassin score is taken as a level of spaminess not just for deciding if something is spam but also for scoore boost, dsn_cutoff, tag/tag2 level etc. As other spam scanners use other scales, a more general definition of spaminess should be found and the decision making should be up to the single plugin. 3. Mail body modification: In order to use the full power of DSPAM, DSPAM should be able to write its signature to the body of the mail. This changes many implicite assumptions and imposes several new problems (one of them is RAM usage). 4. Mail splitting: A mail that is addressed to multiple recipients may have to be split into many if these recipients use different spam profiles which would lead to different mail bodies when using DSPAM. 5. Environment preparation for spam scanning: I like to map one recipient to a certain DSPAM user (as the DSPAM groups are not flexible enough for that). IMHO there should be a hook, how to plug in easily modifiable "environment preparation" plugin. I see this point related to 4. Completing this work would lay grounds for chaining multiple spam scanners with the only problem of combining the results left - although I am personally not convinced that using multiple spam scannes will do any good besides very careful people who want to avoid false positives by all means. LICENSE: My code is clearly a derived work from amavisd-new. Therefore all code written is licensed under the GPL v2 or (at your option) any later version of the GPL. WHAT I HAVE DONE: Now there is the plugin interface and the existing SpamAssassin code is used as a plugin. This should be seen as a first step in a larger refactoring to get fully working DSPAM plugin. Please note that my code had a very poor testing procedure. The patch is available at: http://www.felix-schwarz.name/files/opensource/04_pluginapi.patch SOME COMMENTS: As far as I know there is no other patch which has the features described above. If I did not search thoroughly enough please point it out to me. I would be happy to improve existing code. I just did not ask before writting the code because I want to put up before making big announcements which may never become reality. Please consider my modifications as a prototype work. Therefore I do not expect it to be integrated into main line amavisd-new now. On the other hand I do not want to fork amavisd-new when developing a real patch. The purpose of this mail is to publish my work, gather critics and proposals for improvement and maybe get other interested developers to work at the code. My patch certainly will not match the current coding standards of amavisd-new. One (micro) example is naming: I'm using camel case most of the time (Java notation) while I noticed that amavisd-new uses more underscores (Pascal/Un*x syntax). Additionally I use "my $var = shift();" while I think current coding style is "my($var) = @_;" and prefer to use "()" if I call a method although Perl does not require it. I would adapt it my code to the current coding standard but this way it was easier to write for me. On other things I feel more strongly about: Currently big methods seem to be preferred (do_check contains more than 900 lines of code) while I think that no method should be over 7-10 lines of code (for Perl one needs mostly 10-15 lines because the syntax needs sometimes quite a bit of space to be readable). Instead using the implicite return of the last used value, I prefer writing "return 1;" instead of "1;" for example. Furthermore all the original code is contained in one file but I don't think this makes sense when introducing a plugin interface. The plugins should be developed as independently as possible and open for easy in-house customization. ROADMAP: 1. Reworking the caching code: A cache is only useful for heuristic scanners. The results of statistical filters depend on the complete mail history. Caching should be done in spam scanning plugins. Move the current cache code to the SpamAssassin plugin. 2. Definition of spaminess (see above) 3. Mail body modification (see above) 4. DSPAM plugin 5. Mail splitting (see above) 6. Environment preparation for spam scanning (see above) QUESTION: How many different levels of "spam probability" do make sense? SpamAssassin just returns a float value on a linear scale so one can define arbitrary many levels. Of course one could do this with DSPAM too (by combining confidence and spam probability into a single value). Currently these different levels are used for identifying spam/ham, tag level, tag2 level, dsn cutoff, quarantine cutoff and kill level. I wonder if many people actually find these many levels useful... How could one define different categories on a solid basis so that they are statistically significant? I tend to think only of three categories of mail: ham, probably spam and definitely spam. Maybe there is an additional category named "uncertain" but these should go into ham I think. IMHO one needs to answer this question in order to define the plugin API because I don't think every scanner should adapt to SpamAssassin. I would like to see something like "spam", "definitely spam" etc. as return values which abstract more from actual SpamAssassin score values. The five different levels (tag, tag2, ...) suggest to divide the spam into five different categories. If one uses only two or three spam categories, I think the thresholds should be specified in another way such as: spam: tag definitely spam: tag, tag2, quarantine cutoff, dsn cutoff Comments anyone? Btw: Is there a plan for further development of amavisd-new? In which directions should amavisd-new developed? Have fun! fs ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ AMaViS-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/amavis-user AMaViS-FAQ:http://www.amavis.org/amavis-faq.php3 AMaViS-HowTos:http://www.amavis.org/howto/
