http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4787





------- Additional Comments From [EMAIL PROTECTED]  2006-03-15 15:58 -------
just had this happen again on another box running 0.1 and 12.0 autolearn 
thresholds.

i believe the problem is as ham tokens get expired and spam tokens are the only
thing left in bayes_token, BAYES_99 starts hitting on all email.   The min 200
spam and 200 ham messages required is not a good thing to go off of (except on a
brand new install).   

I dont know how spam_count and ham_count in bayes_vars table can accurately
represent your token distribution....  so having a min spam and min ham token
count would be optimal.   As you can see, bayes_vars says its learned 93k ham,
but out of that, there are currently only 137 ham toks in the table.

mysql> select spam_count,ham_count,token_count from bayes_vars;
+------------+-----------+-------------+
| spam_count | ham_count | token_count |
+------------+-----------+-------------+
|    2463944 |     93579 |     3620764 |
+------------+-----------+-------------+

mysql> select sum(spam_count), sum(ham_count) from bayes_token;
+-----------------+----------------+
| sum(spam_count) | sum(ham_count) |
+-----------------+----------------+
|        34846107 |            137 |
+-----------------+----------------+
1 row in set (0.00 sec)

I propose the following addition to bayes config options..

bayes_min_ham_tokens   <num>
bayes_min_spam_tokens  <num>

and maybe something extra on top of that like ham to spam token ratio.

bayes_ham_spam_token_ratio  0.5  # require 1 ham for every 2 spam tokens

Now i realize that the query is more expensive to sum(spam_count) and
sum(ham_count) on bayes_token table, so I think the bayes_var table could add a
couple cols after `token_count`, say `spam_token_count` and `ham_token_count`,
and auto-expiry/sa-learn could update those fields when it runs.  That way the
query to pull those counts remains efficient.  Also, if those 2 fields are
available, calculating the ham:spam ratio is cake.

if this sounds like an acceptable solution, i may work on it.   unless anyone
has reasons why its not better to do it this way?












------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to