[AMaViS-user] Generic spam scanner interface for amavisd-new (patch)

Felix Schwarz Thu, 05 Jan 2006 01:47:25 -0800

Hi all.

FOREWORD:
After using MailScanner for about three years, I finally switched to
amavisd-new because I liked some things in the implementation better.
One thing I really need is a "decent" DSPAM integration. I do not
consider the current DSPAM integration in amavisd-new as sufficient as
it is currently only a second scanner besides SpamAssasin, uses only
one global user, does not enable easy spam filter retraining by users
and the end result of the spam check is calculated by some hard coded
rules.


Therefore I did some modifications to amavisd-new 2.3.3 in order to
get the features I needed. This first patch was very intrusive,
removed much functionality and I had to make some decisions which are
fundamental to the way amavisd-new is operating. After a bit of
thinking I took a step back, now trying to solve on problem a time
without loosing much functionality.


BASIC CONSIDERATIONS:
I identified five main spots that should be improved before one can
use DSPAM in amavisd-new as mentioned above:
1. Plugin interface: In order to support different spam scanners
   without bloating the amavisd-new code too much, I think there
   should be defined plugin interface so that adding scanners is easy.
2. Definition of spaminess: Currently the SpamAssassin score is taken
   as a level of spaminess not just for deciding if something is spam
   but also for scoore boost, dsn_cutoff, tag/tag2 level etc. As other
   spam scanners use other scales, a more general definition of
   spaminess should be found and the decision making should be up to
   the single plugin.
3. Mail body modification: In order to use the full power of DSPAM,
   DSPAM should be able to write its signature to the body of the
   mail. This changes many implicite assumptions and imposes several
   new problems (one of them is RAM usage).
4. Mail splitting: A mail that is addressed to multiple recipients
   may have to be split into many if these recipients use different
   spam profiles which would lead to different mail bodies when using
   DSPAM.
5. Environment preparation for spam scanning: I like to map one recipient
   to a certain DSPAM user (as the DSPAM groups are not flexible
   enough for that). IMHO there should be a hook, how to plug in
   easily modifiable "environment preparation" plugin. I see this
   point related to 4.

Completing this work would lay grounds for chaining multiple spam
scanners with the only problem of combining the results left -
although I am personally not convinced that using multiple spam
scannes will do any good besides very careful people who want to avoid
false positives by all means.


LICENSE:
My code is clearly a derived work from amavisd-new. Therefore all code
written is licensed under the GPL v2 or (at your option) any later
version of the GPL.


WHAT I HAVE DONE:
Now there is the plugin interface and the existing SpamAssassin code
is used as a plugin. This should be seen as a first step in a larger
refactoring to get fully working DSPAM plugin. Please note that my
code had a very poor testing procedure.

The patch is available at:
http://www.felix-schwarz.name/files/opensource/04_pluginapi.patch


SOME COMMENTS:
As far as I know there is no other patch which has the features
described above. If I did not search thoroughly enough please point it
out to me. I would be happy to improve existing code. I just did not
ask before writting the code because I want to put up before making
big announcements which may never become reality.

Please consider my modifications as a prototype work. Therefore I do
not expect it to be integrated into main line amavisd-new now. On the
other hand I do not want to fork amavisd-new when developing a real
patch. The purpose of this mail is to publish my work, gather critics
and proposals for improvement and maybe get other interested
developers to work at the code.


My patch certainly will not match the current coding standards of
amavisd-new.

One (micro) example is naming: I'm using camel case most of the time
(Java notation) while I noticed that amavisd-new uses more underscores
(Pascal/Un*x syntax). Additionally I use "my $var = shift();" while I
think current coding style is "my($var) = @_;" and prefer to use "()"
if I call a method although Perl does not require it. I would adapt it
my code to the current coding standard but this way it was easier to
write for me.

On other things I feel more strongly about: Currently big methods seem
to be preferred (do_check contains more than 900 lines of code) while
I think that no method should be over 7-10 lines of code (for Perl one
needs mostly 10-15 lines because the syntax needs sometimes quite a
bit of space to be readable). Instead using the implicite return of
the last used value, I prefer writing "return 1;" instead of "1;" for
example. Furthermore all the original code is contained in one file
but I don't think this makes sense when introducing a plugin
interface. The plugins should be developed as independently as
possible and open for easy in-house customization.


ROADMAP:
1. Reworking the caching code: A cache is only useful for heuristic
scanners. The results of statistical filters depend on the complete
mail history. Caching should be done in spam scanning plugins. Move
the current cache code to the SpamAssassin plugin.
2. Definition of spaminess (see above)
3. Mail body modification (see above)
4. DSPAM plugin
5. Mail splitting (see above)
6. Environment preparation for spam scanning (see above)


QUESTION:
How many different levels of "spam probability" do make sense?
SpamAssassin just returns a float value on a linear scale so one can
define arbitrary many levels. Of course one could do this with DSPAM
too (by combining confidence and spam probability into a single
value). Currently these different levels are used for identifying
spam/ham, tag level, tag2 level, dsn cutoff, quarantine cutoff and
kill level. I wonder if many people actually find these many levels
useful... How could one define different categories on a solid basis
so that they are statistically significant?

I tend to think only of three categories of mail: ham, probably spam
and definitely spam. Maybe there is an additional category named
"uncertain" but these should go into ham I think.

IMHO one needs to answer this question in order to define the plugin
API because I don't think every scanner should adapt to SpamAssassin.
I would like to see something like "spam", "definitely spam" etc. as
return values which abstract more from actual SpamAssassin score
values.

The five different levels (tag, tag2, ...) suggest to divide the spam
into five different categories. If one uses only two or three spam
categories, I think the thresholds should be specified in another way
such as:
 spam: tag
 definitely spam: tag, tag2, quarantine cutoff, dsn cutoff

Comments anyone?


Btw: Is there a plan for further development of amavisd-new? In which
directions should amavisd-new developed?

Have fun!

fs



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
AMaViS-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/amavis-user
AMaViS-FAQ:http://www.amavis.org/amavis-faq.php3
AMaViS-HowTos:http://www.amavis.org/howto/

[AMaViS-user] Generic spam scanner interface for amavisd-new (patch)

Reply via email to