On Mon, 26 Aug 2002 11:30:30 +1000
  "Brett Handley" <[EMAIL PROTECTED]> wrote:

>Paul Graham quoted 4000 messages, I only worked with a 
>couple of hundred
>good emails and 14 bad (all I've kept) so with such a low 
>sample size it is
>likely that my tests of the filter will be suspect.

I've been playing around with Brett's code.

It took 17 mins to tokenise 770 sample spam messages, and 
516 "good" messages ( email was first pulled from my email 
server, and then saved to local storage before starting 
the test).

I ended up with 34052 unique tokens from the good mail, 
and 60516 tokens from the spam.

I then ran a test on the same body of good and bad emails.
The script detected one of the "good" email as being spam, 
and looking at that email, I found that I had incorrectly 
misclassified that message as good whereas it was infact 
spam!

The script only detected 604 of the 770 as being spam.

I suspect others will have better results than this.  My 
email is already heavily filtered - I have about 40 
filters running on my mail server, so the tests were run 
on messages that had got thru the filters.  Also, a lot of 
what I consider spam actually looks like my good mail.

The only significant changes I made to Brett's code were 
to strip out attachments before tokenising the message. 
However, I need to still decode text/html base64 encoded 
messages and tokenise them rather than discarding these 
attachments.


--
Graham Chiu
-- 
To unsubscribe from this list, please send an email to
[EMAIL PROTECTED] with "unsubscribe" in the 
subject, without the quotes.

Reply via email to