HTML Mail/Spam Relationship

Scott at HobbyLink Japan Thu, 22 May 2003 03:07:57 +0200

>Scott at HobbyLink Japan on 5/22/03 said

>>I have no knowledge as to SpamSieve's inner workings, but almost all of
>>the spam its missing these days are 100% HTML mails, hence my comment. 
>>The others involve Nigeria and large sums of money.
>
>SpamSieve uses Bayesian filtering which uses every word of the email to
>build its corpus. Mr. Tsai had said that after a while you might have to
>back up the corpus.plist file and select and remove all words in the
>corpus window; then retraining with new good mail and spam mail.
>He says: 
>
>"I did this in late January, and my accuracy
>increased from 91.5% to 98.6%, even though the new corpus only had
>about 1300 messages."


I did this, and yes, it did increase my accuracy.  I'm extremely happy
with the SpamSieve/PowerMail combination overall.  But...

Most of the mails it's missing seem, from the layman's standpoint, to be
completely no-brainers.  All HTML messages laced with porn words and
links to external images.  Perhaps 1 in 20 to 30 HTML mails I get are not
spam, so I'd like the option in either PM itself (can this be done with a
filter?  I don't know how since body filtering is not provided), or in
SpamSieve, to adopt a "guilty until proven innocent" policy regarding
HTML mail, esp. those with links to images.

How hard can this be?

And why does Nigeria mail still get through, even though I have trained
every darn one of them as spam?  One would think by now that the word
'Nigeria' would alone almost be an automatic trigger, but I don't know
exactly how these Bayesian algorithms work. 

---

Scott T. Hards
President
HobbyLink Japan (www.hlj.com)

HTML Mail/Spam Relationship

Reply via email to