Re: [AMaViS-user] Recipient-domain-specific SA bayes db [was 'no subject']

Mark Martinec Thu, 16 Sep 2010 09:29:27 -0700

Yassen,

> Let me give a short background of my problem: I email-host half a dozen of
> domains and amavisd-new does a great job filtering the mail using clamav,
> SA, pyzor, razor and bayes (via SA). Bayes is a VERY helpful addition to
> the other tests and greatly improves the spam filtering success.
> 
> What I noticed was that within a domain bayes works great, probably because
> legitimate mail within a domain tend to have a lot in common (also, spam
> tend to have things in common). The very contrary is true if I compare
> different domains with each other -- users of different domains use
> different languages, not to speak about other differences (I have
> English-speaking domains, German, Bulgarian.) This is the reason that I
> seek a solution to separate bayes database to somehow work "per domain"
> and not be a global one for the whole install. I guess the perfect
> solution would be to maintain a separate bayes db for each user, but the
> very good results for installations with a single db for a whole domain
> makes me believe that this is a good approach that will be a lot simpler
> and yet retain good quality.


Agreed, for such a diverse population of local domains (different languages)
it certainly helps to dedicate one bayes sub-database for each local domain.

I guess in your case letting each user have his own bayes db would yield
worse results than when grouping by domain. Some grouping of similar users
is desirable, spam learning works faster and users benefit from each other.

Btw, SpamAssassin 3.4.0 (from SVN) has an interesting and useful
feature: bayes_auto_learn_on_error 1
It causes autolearning only when useful:

$ man Mail::SpamAssassin::Plugin::AutoLearnThreshold

bayes_auto_learn_on_error (0 | 1)        (default: 0)

  With "bayes_auto_learn_on_error" off, autolearning will be
  performed even if bayes classifier already agrees with the new
  classification (i.e.  yielded BAYES_00 for what we are now trying
  to teach it as ham, or yielded BAYES_99 for spam). This is a
  traditional setting, the default was chosen to retain backwards
  compatibility.

  With "bayes_auto_learn_on_error" turned on, autolearning will be
  performed only when a bayes classifier had a different opinion from
  what the autolearner is now trying to teach it (i.e. it made an
  error in judgement). This strategy may or may not produce better
  future classifications, but usually works very well, while also
  preventing unnecessary overlearning and slows down database growth.


> >> My current plan is to introduce $sa_bayes_path in amavisd-new config
> >> file(s), have amavisd-new patched to honor that argument when calling
> >> SA, and also have it listen on a separate port for each domain. I will
> >> then use policy banks to tune that same $sa_bayes_path argument
> >> differently for each of the different ports (=domains).
> 
> This didn't work for me; I guess because amavid-new passes parameters to SA
> only when instantiating it, that is, at startup time.

True. That would only work by switching config files, as available
since 2.7.0-pre7 through its @sa_userconf_maps. But switching configurations
is rather slow, so I'd not recommend it unless it happens infrequently.


> > As it happens, the switching of SpamAssassin configurations between
> > messages (or even within a processing of a single mail message with
> > multiple recipients) is a rather costly operation. For the purpose
> > of switching a username used for Bayes SQL lookups it suffices to
> > tell SpamAssassin to switch a username without loading his preferences
> > config file. Such username switching is a fairly inexpensive operation.
> 
> So I should consider using an SQL-based bayes database, correct?

Correct. Switching a username for the purpose of bayes_vars.username
and bayes_token.id costs only about 12 ms on our mailer, this is the
preferred way to implement per-domain or per-recipient bayes and AWL
databases.

I will not be opening a can of worms by supporting real switching
of user/uid and accessing home directories of real Unix users.
This would require amavisd to run as root or have access to
such user files and their personal bayes databases. Too expensive,
too risky securitywise, and unnecessary - as we have a simpler,
safer and faster alternative through SQL.


> > This is a fairly straightforward change from the current 2.7.0-pre7,
> > based on all the already laid-down supporting mechanisms, and I guess
> > I can make it into 2.7.0-pre8 without too much trouble, if someone
> > is interested.
> 
> I am one (obviously); anyone else voting here?

It will be in -pre8, maybe tomorrow. I already have it running here,
looks good. Here is how a recipient address can be mapped to a
virtual username:

@sa_username_maps = ({
  '.example.com' => 'virt-user1',
  '.example.net' => 'example.net',
  'mys...@here.example.com' => 'myself',
  '.' => 'vscan',
});

SpamAssassin username switching is done throug a call to
its $spamassassin_obj->signal_user_changed(...). The rest
is all up to its Bayes plugin. Just keep in mind that the
bayes_sql_override_username must *not* be used when switching
bayes username is desired.

  Mark

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
AMaViS-user mailing list
AMaViS-user@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/amavis-user 
 Please visit http://www.ijs.si/software/amavisd/ regularly
 For administrativa requests please send email to rainer at openantivirus dot 
org

Re: [AMaViS-user] Recipient-domain-specific SA bayes db [was 'no subject']

Reply via email to