Suggesting bayes_atime_update configuration option

Vesa-Matti J Kari Fri, 09 Jul 2010 05:06:42 -0700

Hello,

I'd suggest adding a new configuration option called bayes_atime_update,
but first of all, let me briefly explain what kind of spam detection setup
I am referring to. This is so that the rationale would be made clear.



There is a set of four spam detecting nodes. For brevity, let's call them
A, B, C and D. The task of spam detection among the nodes is
divided evenly (i.e. round-robin).

A, B, C and D each run MIMEDefang-milter. SpamAssassin is called via
MIMEDefang, and the Bayes functionality is turned on. The automatic
expiration of tokens is turned off.

The tokens and data related to them are stored in MySQL databases, with
each node running and using its own MySQL server.


  A = the MySQL replication master node
  B, C, D = MySQL replication slave nodes


The idea is that whenever the bayesian filter is taught with new spam/ham
messages, the learning is done on the node A. MySQL servers are set up so
that all changes on node A are propagated to the slaves B, C, D.


However, this setup is currently a little messy, because instead of
relying exclusively on the data arriving from the master, the slaves keep
on updating their own databases.

To be more precise, B, C and D update the atime-field in their bayes_token
tables, and also the newest_token_age in their bayes_vars tables. When the
tokens are expired, this can lead to different databases with respect to
tokens, because the atime-fields can be different among the nodes. There
are also lots of unnecessary UPDATEs on the slaves.


In order to keep the database atimes consistent with every node, and to
reduce unneeded work on the slaves, I have added a new boolean
configuration option to SpamAssassin. It is called "bayes_atime_update"
and by default, it is set to 1 so that no existing installations should
break.


My plan is to turn on the bayes_atime_update on the master node A and off
on the slaves B, C, D. That way the slaves would keep in synch
with the master and they would not have to update the atime-fields
by themselves.


Does this make sense? And would it be possible to apply something like the
following patch to the SpamAssassin main line? Thanks for the information.



diff -urp SpamAssassin.old/Conf.pm SpamAssassin/Conf.pm
--- SpamAssassin.old/Conf.pm    2010-07-09 13:32:12.000000000 +0300
+++ SpamAssassin/Conf.pm        2010-07-09 13:33:44.000000000 +0300
@@ -1421,6 +1421,18 @@ for details on how Bayes auto-learning i
     type => $CONF_TYPE_BOOL,
   });

+=item bayes_atime_update ( 0 | 1 )      (default: 1)
+
+Whether SpamAssassin should update the tokens' atime.
+
+=cut
+
+  push (@cmds, {
+    setting => 'bayes_atime_update',
+    default => 1,
+    type => $CONF_TYPE_BOOL,
+  });
+
 =item bayes_ignore_header header_name

 If you receive mail filtered by upstream mail systems, like
diff -urp SpamAssassin.old/Plugin/Bayes.pm SpamAssassin/Plugin/Bayes.pm
--- SpamAssassin.old/Plugin/Bayes.pm    2010-07-09 13:36:08.000000000
+0300
+++ SpamAssassin/Plugin/Bayes.pm        2010-07-09 13:39:15.000000000
+0300
@@ -769,7 +769,9 @@ sub scan {
   # no need to call tok_touch_all unless there were significant
   # tokens and a score was returned
   # we don't really care about the return value here
-  $self->{store}->tok_touch_all(\...@touch_tokens, $msgatime);
+  if ($self->{conf}->{bayes_atime_update}) {
+         $self->{store}->tok_touch_all(\...@touch_tokens, $msgatime);
+  }

   $permsgstatus->{bayes_nspam} = $ns;
   $permsgstatus->{bayes_nham} = $nn;



Regards,
vmk
-- 
************************************************************************
               Tietotekniikkaosasto / Helsingin yliopisto
                 IT department / University of Helsinki
************************************************************************

Suggesting bayes_atime_update configuration option

Reply via email to