On Sun, 15 Feb 2004, Stroller uttered the following immortal words, > > On Feb 15, 2004, at 10:27 pm, Ralph Slooten wrote: > > > On Sun, 15 Feb 2004 20:09:58 +0000 > > Stroller <[EMAIL PROTECTED]> wrote: > > > >> "only caught 92% of spam, with 1.16% false positives". > > > > So far I've been running it (bmf) for a week.... > > Note that the figure of 92% did *NOT* refer to bmf or Bogofilter, but > to a filter written only for some tests done c 1998.
Yes we typically expect 99% from bayesian filters. > > For my "good" mail I fed it with most of my my friends ... > > and from all the mailing lists I belong to from this 2 > > weeks +-(can be downloaded from their monthly archives if need be). > > For me, this is would be unrepresentative, because all my messages I > receive from mailing lists are filtered into folders based on headers > such as "List-Post:" - there's no advantage in statistically > classifying them. Also I find that bayesian filters have a problem when you feed them list mail, somehow mailing lists look suspiciously like spam to them (because they contain a lot of uncessary data?), so if you feed them a large amount of mailing list data tagged as non spam, then I find the spam detection rate drops a lot. > > For each mail caught as spam, the database automatically updates itself > > with any new contents of that mail, making it learn as it's catching > > mails. > > I regard this as risky business - if you fail to reclassify any > mistakes, then the filter will be more likely to make errors in future. > You do state that bmf allows you to reclassify mistakes, but IMO it's > better only to add spam/ham messages to the token database when the > user specifically requests them. Actually I thought that a bayesian filter is supposed to do exactly the above, ie it learns and autoupdates its database as as time goes by it gets better and better. Grendel -- Grendels annoyance filter is so advanced it puts people to the killfile even before they have posted. -- [EMAIL PROTECTED] mailing list
