Re: [AMaViS-user] Generic spam scanner interface for amavisd-new (patch)

Mark Martinec Tue, 10 Jan 2006 11:59:09 -0800

Felix,

> One thing I really need is a "decent" DSPAM integration. I do not
> consider the current DSPAM integration in amavisd-new as sufficient as
> it is currently only a second scanner besides SpamAssasin, uses only
> one global user, does not enable easy spam filter retraining by users
> and the end result of the spam check is calculated by some hard coded
> rules.
> Therefore I did some modifications to amavisd-new 2.3.3 in order to
> get the features I needed. This first patch was very intrusive,
> removed much functionality and I had to make some decisions which are
> fundamental to the way amavisd-new is operating. After a bit of
> thinking I took a step back, now trying to solve on problem a time
> without loosing much functionality.


I have incorporated your first-step split of spam-scanning code into
general and SA-specific, so this should make your further efforts
hopefully less intrusive. (I send the code snapshot in private mail).

> BASIC CONSIDERATIONS:
> I identified five main spots that should be improved before one can
> use DSPAM in amavisd-new as mentioned above:
> 1. Plugin interface: In order to support different spam scanners
>    without bloating the amavisd-new code too much, I think there
>    should be defined plugin interface so that adding scanners is easy.

Loading of SA-specific code is now optional, it is a separate module,
but in step with the current practice it still resides in the same file.

> 2. Definition of spaminess: Currently the SpamAssassin score is taken
>    as a level of spaminess not just for deciding if something is spam
>    but also for scoore boost, dsn_cutoff, tag/tag2 level etc. As other
>    spam scanners use other scales, a more general definition of
>    spaminess should be found and the decision making should be up to
>    the single plugin.

Perhaps. Often machine-learning (AI) programs try to simplify
the representation of a complex world by representing it with
discrete attributes (like low/med/high). Eventually someone clever
enough finds a way to effectively deal with continuous values again,
often with better results at not much higher computational cost.
This may not be a direct comparision, but I do find it quite
flexible to deal with linear scale as used by SA. Splitting it
into (say) three regions does not offer enough control, and using
say 7 discrete regions is worse that having a numerical value.

This does not mean that end users need to deal with SA values directly.
An administrative interface can easily map user preferences such as 
probable/likely/certail spam into internal values according to
administrators settings.

There indeed are several config variables such as DSN- and quarantine-
cutoff levels, tag, tag2 and kill levels. Of all these, the end-user
is only concerned with kill level (normally tag2 level == kill level).
Both cutoff levels are purely in domain of site mail administrator,
and tag level is non-critical, it need not be settable by end-user.

Indeed other spam scanners use different scales and to compare them
they all need to be mapped to comparable scales. It is just not that
obvious to me that a few-attributes scale is better than a linear scale,
at least internally.

In my quick-hack code to support DSPAM I mapped DSPAM results to
score values by SA rules to contribute to the final score. One can
make the score as large as wanted - in my case it turned out to be
prudent to keep it faily low :)

> 3. Mail body modification: In order to use the full power of DSPAM,
>    DSPAM should be able to write its signature to the body of the
>    mail. This changes many implicite assumptions and imposes several
>    new problems (one of them is RAM usage).

Indeed. The $msginfo->mail_text does offer a partial solution, it can
be a file handle, or a MIME::Entity object (which is currently used
for notifications and for mail defanging). A drawback is that handling
MIME::Entity object is slower than handling mail on a file, and that
currently it is not possible to have per-recipient mail body changes.

For the purpose of having mail defanging available on a per-recipient
basis, some changes would be necessary. This is on a to-do list,
but it is a rather complex change.

> 4. Mail splitting: A mail that is addressed to multiple recipients
>    may have to be split into many if these recipients use different
>    spam profiles which would lead to different mail bodies when using
>    DSPAM.

This is related to the above. Mail splitting when header needs to be different
is available now, but representing per-recipient body changes is missing.

> 5. Environment preparation for spam scanning: I like to map one recipient
>    to a certain DSPAM user (as the DSPAM groups are not flexible
>    enough for that). IMHO there should be a hook, how to plug in
>    easily modifiable "environment preparation" plugin. I see this
>    point related to 4.

I'm leaving this up to you.

> Completing this work would lay grounds for chaining multiple spam
> scanners with the only problem of combining the results left -
> although I am personally not convinced that using multiple spam
> scannes will do any good besides very careful people who want to avoid
> false positives by all means.

For some time it appeared to me that combining DSPAM as one of the
SA rules was beneficial (but not DSPAM alone). Now I'm not convinced
anymore that I would like to have two spam scanners. Certainly opinions
will differ on that point. I'd still prefer if SA folks would make
a plugin for DSPAM, but this is an unlikely event.

> WHAT I HAVE DONE:
> Now there is the plugin interface and the existing SpamAssassin code
> is used as a plugin. This should be seen as a first step in a larger
> refactoring to get fully working DSPAM plugin. Please note that my
> code had a very poor testing procedure.
> The patch is available at:
> http://www.felix-schwarz.name/files/opensource/04_pluginapi.patch

Thanks. Mostly I needed to separate initialization code into two
parts, one that needs to be done before chroot takes place (loading of
modules), and the rest, which (for security) should be done after
chroot and after changing UID takes place.

> SOME COMMENTS:
> As far as I know there is no other patch which has the features
> described above. If I did not search thoroughly enough please point it
> out to me. I would be happy to improve existing code. I just did not
> ask before writting the code because I want to put up before making
> big announcements which may never become reality.
>
> Please consider my modifications as a prototype work. Therefore I do
> not expect it to be integrated into main line amavisd-new now. On the
> other hand I do not want to fork amavisd-new when developing a real
> patch. The purpose of this mail is to publish my work, gather critics
> and proposals for improvement and maybe get other interested
> developers to work at the code.

I wouldn't like to see a fork either.

> My patch certainly will not match the current coding standards of
> amavisd-new.
>
> One (micro) example is naming: I'm using camel case most of the time
> (Java notation) while I noticed that amavisd-new uses more underscores
> (Pascal/Un*x syntax).

I mostly kept this style from previous amavis code.
But I admit my programming background is indeed Pascal :)
I would use camel case for subroutine names and modules,
and all-lowercase with undercores for variables, if I started
from scratch.

> Additionally I use "my $var = shift();" while I 
> think current coding style is "my($var) = @_;"

My choice of using my with parenthesis is based on my first
inexperienced steps with Perl, where I would try to:

  my $a, $b, $c;
  my $x, $y=1;

only to find out that only the first variable after 'my' belongs
to the scope of my, remaining variables are not made local.

After that bitter experience, I made my rule to always place
arguments to 'my' in a list:

  my($a, $b, $c);
  my($x,$y) = (2,3);

which either does what is 'obvious', or points out a mistake.

> and prefer to use "()" 
> if I call a method although Perl does not require it.

I do not mind a () after shift or other built-in functions,
but I consider it a waste of space in method calls such as $obj->new()

> I would adapt it my code to the current coding standard
> but this way it was easier to write for me.

Sure, everyone has his preferences.
I don't mind some diversity in areas where I do not have strong
preferences, although I do prefer consistent style and sometimes
go to trouble of changing some aspect throughout the program
when my 'threshold of ugliness' is reached. I also like a compact
style, where functional units fit on a screenful. And I like
to make a condition in an if clause such that a shorter code
follows the 'then' and longer code section follows the 'else',
which does not break the reader's attention span.

> On other things I feel more strongly about: Currently big methods seem
> to be preferred (do_check contains more than 900 lines of code) while
> I think that no method should be over 7-10 lines of code (for Perl one
> needs mostly 10-15 lines because the syntax needs sometimes quite a
> bit of space to be readable).

You mean check_mail. This is indeed an extreme case and I don't like
it either. It just kept growing from an already large initial code,
and I never got to it to make a split. As for the rest, I think
the 10-15 is fine for utility routines, although I prefere somewhat
larger chunks when dealing with a single functional unit.

> Instead using the implicite return of 
> the last used value, I prefer writing "return 1;" instead of "1;"
> for example.

Here I disagree. My main grudge against a 'return' is that it makes
people use it from a middle of a routine. Although I do use this
myself on a rare occasion, I much prefer that only the last
statement is a return. In that sense, this makes 'return'
redundant, it just wastes space, specially in tiny functions.
It something like a pascal-style asssignment to a function name
would be possible (without causing a return), that would be
the best compromise in my view - explicit labeling of function
result, without changing program flow.

> Furthermore all the original code is contained in one file 
> but I don't think this makes sense when introducing a plugin
> interface. The plugins should be developed as independently as
> possible and open for easy in-house customization.

It is indeed incompatible with plugin interface, something
needs to be thought out.

I currently prefer keeping all modules in one file for
two reasons: it makes installation easier, just two files
two worry about; and it makes my editing habits easy,
not having to jump from file to file when changing more
global aspects of a program.

I would not mind having an automatic procedure to split
the file into modules and splice them back to a single file,
if that would make other contributors happy.
Just let me have my single file :)

> ROADMAP:
> 1. Reworking the caching code: A cache is only useful for heuristic
> scanners. The results of statistical filters depend on the complete
> mail history. Caching should be done in spam scanning plugins. Move
> the current cache code to the SpamAssassin plugin.

Caching would benefit when separated into its own module.
But it is not true that only SA benefits from it - both
viruses and clean messages benefit from cache.

> 2. Definition of spaminess (see above)
> 3. Mail body modification (see above)
> 4. DSPAM plugin
> 5. Mail splitting (see above)
> 6. Environment preparation for spam scanning (see above)
>
>
> QUESTION:
> How many different levels of "spam probability" do make sense?
> SpamAssassin just returns a float value on a linear scale so one can
> define arbitrary many levels. Of course one could do this with DSPAM
> too (by combining confidence and spam probability into a single
> value). Currently these different levels are used for identifying
> spam/ham, tag level, tag2 level, dsn cutoff, quarantine cutoff and
> kill level. I wonder if many people actually find these many levels
> useful... How could one define different categories on a solid basis
> so that they are statistically significant?
> I tend to think only of three categories of mail: ham, probably spam
> and definitely spam. Maybe there is an additional category named
> "uncertain" but these should go into ham I think.
> IMHO one needs to answer this question in order to define the plugin
> API because I don't think every scanner should adapt to SpamAssassin.
> I would like to see something like "spam", "definitely spam" etc. as
> return values which abstract more from actual SpamAssassin score
> values.
>
> The five different levels (tag, tag2, ...) suggest to divide the spam
> into five different categories. If one uses only two or three spam 
> categories, I think the thresholds should be specified in another way
> such as:
>  spam: tag
>  definitely spam: tag, tag2, quarantine cutoff, dsn cutoff
> Comments anyone?

Don't know. My opinion on that issue is above.

> Btw: Is there a plan for further development of amavisd-new?
> In which directions should amavisd-new developed?

My tactical list is on:
  http://www.ijs.si/software/amavisd/TODO.txt

As for long term goals I have no special roadmap,
I'm addressing needs by their priority: the ones with good
effect-to-work ratio come first.

  Mark


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
AMaViS-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/amavis-user
AMaViS-FAQ:http://www.amavis.org/amavis-faq.php3
AMaViS-HowTos:http://www.amavis.org/howto/

Re: [AMaViS-user] Generic spam scanner interface for amavisd-new (patch)

Reply via email to