[Bug 8094] Non balanced bayes ratio in db makes the accuracy plummet

bugzilla-daemon Fri, 23 Dec 2022 10:56:32 -0800

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8094


Bill Cole <billc...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
         Resolution|---                         |WORKSFORME
           Priority|P2                          |P4
                 CC|                            |billc...@apache.org
             Status|NEW                         |RESOLVED

--- Comment #1 from Bill Cole <billc...@apache.org> ---
No documented explicit assumption exists in the code regarding the ratio of ham
to spam in the Bayes training corpus. I don't believe there has been
significant attention to the details of the Bayes implementation in many years
however, so it is possible that some assumption is implied in the code and no
one has noticed. 

I don't believe we have any data that could confirm or refute a relationship
between Bayes accuracy and the ham/spam training ratio. Anecdotally, I just
checked 3 systems I work with which do not have discernible Bayes errors and
none of them has more than 5% spam in the training DB. 

One known source of Bayes inaccuracy is failure to expire the Bayes DB
regularly. Over time, the character of spam evolves and as a result the scores
of older tokens are increasingly obsolete. If your Bayes DB is dominated by
tokens more than about 2 weeks old, it will not be very accurate. If you use
MySQL for the Bayes DB, you may find it necessary to forcibly expire the DB,
particularly on an active server. It is also possible to damage the accuracy of
the Bayes DB by improper training, especially by use of the 'autolearn' feature
of SA or learning user-identified ham/spam without robust oversight. 

We are always open to improved implementations of our existing tactics such as
Bayesian analysis and the plugin architecture facilitates creating
alternatives. I don't believe that there is anyone currently working on an
alternative Bayes implementation, and the place to ask a broader audience about
that would be our Developers' mailing list, which is open to the public. I
would not expect anyone to take on such a task without a well-defined
reproducible (or at least broadly recognized) problem. It also may be helpful
to raise this issue with the broader SA community by discussing it on the
SpamAssassin Users mailing list, if only to solidify whether others see the
same problem. 

Because it is so hard to nail down Bayes problems as due to actual bugs in
code, rather than mis-training, the standard response to chronic misfires of
the BAYES_* rules is to wipe and retrain the DB with recent hand-classified ham
and spam, as it is generally not possible to identify the messages one would
need to forget to undo the complex damage that mislearning can cause.  

I am resolving this bug as "works for me" because it does not identify a
reproducible error and we are not in a position to replace/refactor the Bayes
implementation without a concrete definition of what needs fixing and what
could constitute a fix.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 8094] Non balanced bayes ratio in db makes the accuracy plummet

Reply via email to