https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8094
Bill Cole <billc...@apache.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement Resolution|--- |WORKSFORME Priority|P2 |P4 CC| |billc...@apache.org Status|NEW |RESOLVED --- Comment #1 from Bill Cole <billc...@apache.org> --- No documented explicit assumption exists in the code regarding the ratio of ham to spam in the Bayes training corpus. I don't believe there has been significant attention to the details of the Bayes implementation in many years however, so it is possible that some assumption is implied in the code and no one has noticed. I don't believe we have any data that could confirm or refute a relationship between Bayes accuracy and the ham/spam training ratio. Anecdotally, I just checked 3 systems I work with which do not have discernible Bayes errors and none of them has more than 5% spam in the training DB. One known source of Bayes inaccuracy is failure to expire the Bayes DB regularly. Over time, the character of spam evolves and as a result the scores of older tokens are increasingly obsolete. If your Bayes DB is dominated by tokens more than about 2 weeks old, it will not be very accurate. If you use MySQL for the Bayes DB, you may find it necessary to forcibly expire the DB, particularly on an active server. It is also possible to damage the accuracy of the Bayes DB by improper training, especially by use of the 'autolearn' feature of SA or learning user-identified ham/spam without robust oversight. We are always open to improved implementations of our existing tactics such as Bayesian analysis and the plugin architecture facilitates creating alternatives. I don't believe that there is anyone currently working on an alternative Bayes implementation, and the place to ask a broader audience about that would be our Developers' mailing list, which is open to the public. I would not expect anyone to take on such a task without a well-defined reproducible (or at least broadly recognized) problem. It also may be helpful to raise this issue with the broader SA community by discussing it on the SpamAssassin Users mailing list, if only to solidify whether others see the same problem. Because it is so hard to nail down Bayes problems as due to actual bugs in code, rather than mis-training, the standard response to chronic misfires of the BAYES_* rules is to wipe and retrain the DB with recent hand-classified ham and spam, as it is generally not possible to identify the messages one would need to forget to undo the complex damage that mislearning can cause. I am resolving this bug as "works for me" because it does not identify a reproducible error and we are not in a position to replace/refactor the Bayes implementation without a concrete definition of what needs fixing and what could constitute a fix. -- You are receiving this mail because: You are the assignee for the bug.