> > So, we wouldn't be able to unlearn historical emails (processed
> > before the upgrade),

Yes.

> > but there'd be few other side-effects?

Right, practically no other ill effects. The tokens format remains unchanged.

> yes old learned mails cant be unlearned, even bayes_tokens is equal,
> it cant track if it was spam or ham learned

The spamminess is a property of a token, so you still know if a
token is spammy or hammy even if there is no more trace of
the original message in the 'seen' set.

It is not uncommon to ditch the 'seen' table (if using SQL)
every now and then when it grows large. Changing the
canonicalization algorithm to compute a message digest as
used in a 'seen' set is no different than ditching the 'seen' table,
just wastes some storage until these entries expire.

In my view there is hardly any reason to worry about the
canonicalization algorithm change for computing message digest.
The effect is the same as setting expiration time on the 'seen'
entries very short (this is now configurable in a Redis backend).
Unchanged tokens set ensures the Bayes classification remains
as effective as before.

  Mark

Reply via email to