http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4787
------- Additional Comments From [EMAIL PROTECTED] 2006-03-15 15:58 ------- just had this happen again on another box running 0.1 and 12.0 autolearn thresholds. i believe the problem is as ham tokens get expired and spam tokens are the only thing left in bayes_token, BAYES_99 starts hitting on all email. The min 200 spam and 200 ham messages required is not a good thing to go off of (except on a brand new install). I dont know how spam_count and ham_count in bayes_vars table can accurately represent your token distribution.... so having a min spam and min ham token count would be optimal. As you can see, bayes_vars says its learned 93k ham, but out of that, there are currently only 137 ham toks in the table. mysql> select spam_count,ham_count,token_count from bayes_vars; +------------+-----------+-------------+ | spam_count | ham_count | token_count | +------------+-----------+-------------+ | 2463944 | 93579 | 3620764 | +------------+-----------+-------------+ mysql> select sum(spam_count), sum(ham_count) from bayes_token; +-----------------+----------------+ | sum(spam_count) | sum(ham_count) | +-----------------+----------------+ | 34846107 | 137 | +-----------------+----------------+ 1 row in set (0.00 sec) I propose the following addition to bayes config options.. bayes_min_ham_tokens <num> bayes_min_spam_tokens <num> and maybe something extra on top of that like ham to spam token ratio. bayes_ham_spam_token_ratio 0.5 # require 1 ham for every 2 spam tokens Now i realize that the query is more expensive to sum(spam_count) and sum(ham_count) on bayes_token table, so I think the bayes_var table could add a couple cols after `token_count`, say `spam_token_count` and `ham_token_count`, and auto-expiry/sa-learn could update those fields when it runs. That way the query to pull those counts remains efficient. Also, if those 2 fields are available, calculating the ham:spam ratio is cake. if this sounds like an acceptable solution, i may work on it. unless anyone has reasons why its not better to do it this way? ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
