On Feb 25, 2004, at 8:46 AM, Sonja Monsen Ray wrote:


My understanding (and someone please correct me if I'm wrong) is that Mail uses something called "Bayesian filtering" which is supposed to be capable of "learning," over time, what is junk and what isn't. I keep seeing references to Bayesian filtering as being the best overall type of spam filtering available -- it's not just a static list of words and phrases that commonly appear in spam; it's something at least resembling artifical intelligence in how it learns what to designate as spam.

I believe that emptying your junk mail box won't affect its educational process at all.

Googling up "Bayesian filtering" will probably get you more information than you ever wanted to know about how it works. I don't understand it at all.

Here's a somewhat technical description by the guy who invented the technique: <http://www.paulgraham.com/spam.html>


Basically, though, what this sort of filtering does is build up databases of 'words' (which includes things like html commands and such) in e-mails, in their headers, and in the body, and sorts them into two piles, traditionally called 'spam' for , well, spam, and 'ham' for good messages.

This counts on the fact that spam e-mails want to sell you something, no matter how they word things, and so a larger number of words in the 'spam' list will appear. (note, some of the html tricks spammers play are also easily recognized.)

In essence, what this sort of technique does is quantify our own ability to look at two messages, one spam and the other not, and instantly see which is which.

Once a score is developed, the message is classified as spam or not.

Now you train the system by designating missed spam messages as 'Junk', or false positives as 'Not Junk', which helps the system tune the recognition threshold and databases for *your* e-mail mix.

We know this works, because the spammers have taken to poison pill tactics. I'm sure you've noticed spam mails with tons of random words in them. What the spammers are trying to do here is twofold:

1) By including a lot of non-spammish words in the message, they hope to lower the 'spam' content of the message (mostly just an html link, these days) so that it evades the filter.

2) If you go ahead and mark it as spam anyway, they hope to 'posion' the spam/ham databases to start generating enough false positives so that this system starts becoming less and less useful, requiring you to continually pore through your declared spam for messages that are falsely marked as positive.

(like one person who emailed me for help with some systems I run. I only happened to catch her e-mail in the Junk folder. It was declared that because she'd found some stupid doodad for Outlook that put in pretty html backgrounds and dancing icons, and junk like that. She e-mails me in plain text now...)

The latest twist of the spammers is to include passages of real text, from out-of-copyright sources. People have reported getting several paragraphs of 'The Wizard of Oz' in their herbal viagra ads.

This is a serious problem. Some estimates (admittedly high) put spam as comprising 80% of all e-mail by the end of the year. Lots of people have pretty much stopped using e-mail to communicate because of it.

Worse, cures are being presented that are worse than the disease, such as Bill Gates suggesting that charging for e-mail will help reduce spam (like it's stopped junk snail mail, right). Of course, who is looking to collect all those fees, hmm?? Moreover, national laws do nothing to stop this, the internet does not respect borders. The Do-Not-Call registry won't work for e-mail.

The true solution is to implement IPV6, which is the next generation of TCP/IP, but that requires, pretty much, dismantling the existing internet and rebuilding it on a new foundation.

This solves a number of problems, both from the standpoint of security (it's far more difficult to hide who you are in IPV6, making it far simpler to easily track and filter spammers and virus spreaders. Also much of the protocol is encrypted, enhancing the privacy of what we send) and usability (It extends, by mind-boggling amounts, the number of available IP addresses...with IPV6, we could assign a separate IP address for *every* man-made object on the planet, from the Great Wall of China down to *each* button on your shirt.)

My personal feeling is tracking some of these people down and kneecapping them might be lenient.

--
Bruce Johnson
University of Arizona
College of Pharmacy
Information Technology Group

Institutions do not have opinions, merely customs


-- G-List is sponsored by <http://lowendmac.com/> and...

Small Dog Electronics    http://www.smalldog.com | Refurbished Drives |
-- We have Apple Refurbished Monitors in stock!  |  & CDRWs on Sale!  |

Support Low End Mac <http://lowendmac.com/lists/support.html>

G-List list info:       <http://lowendmac.com/lists/g-list.shtml>
 --> AOL users, remove "mailto:";
Send list messages to:  <mailto:[EMAIL PROTECTED]>
To unsubscribe, email:  <mailto:[EMAIL PROTECTED]>
For digest mode, email: <mailto:[EMAIL PROTECTED]>
Subscription questions: <mailto:[EMAIL PROTECTED]>
Archive: <http://www.mail-archive.com/g-list%40mail.maclaunch.com/>

Using a Mac? Free email & more at Applelinks! http://www.applelinks.com

Reply via email to