On Feb 15, 2004, at 10:27 pm, Ralph Slooten wrote:
On Sun, 15 Feb 2004 20:09:58 +0000 Stroller <[EMAIL PROTECTED]> wrote:
"only caught 92% of spam, with 1.16% false positives".
So far I've been running it (bmf) for a week....
Note that the figure of 92% did *NOT* refer to bmf or Bogofilter, but to a filter written only for some tests done c 1998.
So far I've been running it (bmf) for a week (yes wow, lol, but hear
me out ;-) ), and it's gotten 100% of the approx 70+ spams I've
received. Initially 2 "good" mails got through, but it's simply a matter
of reprocessing those 2 incorrect mails through bmf again, stating they
are incorrectly detected as spam, and you're set..
That's not *actually* 100%, then is it..? It's about (77/79*100)% = 97.4%. [1]
That seems to me to be a very good figure for a brand-new database - that's because you have wisely used a large corpus to train with. In terms of general success rates of "Bayesian" filters, however, it's not particularly stunning. You will probably find that the next spam to slip through isn't for a week or two, and your success rate will get higher.
For my "good" mail I fed it with most of my my friends ... and from all the mailing lists I belong to from this 2 weeks +-(can be downloaded from their monthly archives if need be).
For me, this is would be unrepresentative, because all my messages I receive from mailing lists are filtered into folders based on headers such as "List-Post:" - there's no advantage in statistically classifying them.
For each mail caught as spam, the database automatically updates itself with any new contents of that mail, making it learn as it's catching mails.
I regard this as risky business - if you fail to reclassify any mistakes, then the filter will be more likely to make errors in future. You do state that bmf allows you to reclassify mistakes, but IMO it's better only to add spam/ham messages to the token database when the user specifically requests them. It's quick & easy enough to review a week's worth of messages in the suspected-spam folder & drag them to the confirmed-spam folder, and for me a false-positive is FAR more worrying than several false-negatives.
Once you have a large enough database (I think maybe 20,000+ messages), it's worth no longer training it on messages that the filter has caught correctly. If you train it only on messages that it has failed to classify correctly (called "training-on-error") those messages will have a stronger impact on the database, and lead to higher accuracy in future.
Stroller.
[1] I'm using 79 as an arbitrary figure to estimate your term "70+".
-- [EMAIL PROTECTED] mailing list
