[Bug 5497] Bayes has become unusable

bugzilla-daemon Thu, 07 Jun 2007 10:17:39 -0700

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5497






------- Additional Comments From [EMAIL PROTECTED]  2007-06-07 10:03 -------
(In reply to comment #32)
> My interpretation of the comments in bug 5257 is that people were reporting
> problems with a threshold of 0.1 because of too many low scoring spam being
> incorrectly learned so the ham threshold was lowered to -1.0.

It is not clear if the problem is that spam is incorrectly learned as ham, or
that there are just too many learning operations going on on a heavily loaded
system and it would be desirable to cut it down a bit.  When the latter is the
actual problem, I suggest the method from comment #29 to be used, not a change
of the threshold.

Problem is that in our case we apparently get no messages below auto_learn -1.0
at all.  Even at 0.1 there are many ham messages not handed to auto_learn
because it is so easy to get above 0.1 when AWL and BAYES_00 are not counted.
Rules like RDNS_NONE are firing half of the time here, HTML_MESSAGE almost
always.  This means that even a slight change in the scoring will easily disable
the autolearn=ham.  This probably explains the change in behavior when
installing the update to 3.2.0
I will try the mentioned patch and see what a more reasonable value for the
threshold is in our case.

I can understand that you want to avoid feed-forward lockups by excluding the
score of BAYES_xx in the calculation, and to a lesser extent I can understand
the exclusion of AWL, but all together it makes the auto_learn quite fragile.

Something that also affects our Bayes DB is that we are a locally operating
company where 99+ % of all mail is in Dutch.  So the Bayes engine has learned
over time that Dutch=HAM and English=SPAM.  This normally works well, but when
someone sends a message from freemail providers that tag an English commercial
under each mail, and they send only an attachment with little body text, it is
scored at Bayes_80 or more, and lifted over our spam threshold by simple things
like omitting the subject.
And those messages are never learned as ham because those freemail providers
invariably score points in the "ignorance" and "HTML" categories.  So our Bayes
DB never gets learned that "Choose the right car based on your needs.  Check out
Yahoo! Autos new Car Finder tool." does not really mean the message is SPAM.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5497] Bayes has become unusable

Reply via email to