On Sat, 1 Feb 2003, Sam Ewalt wrote: > > On Thu, 30 Jan 2003 Steve <[EMAIL PROTECTED]> writes: > >> If you quote spam, your entire e-mail is processed as > >> spam, i.e., it *becomes* spam. > > Would it be more accurate to say that quoting spam would make > the entire email "appear" as if it was spam to anti-spam > filtering programs that it might encounter? And only then if > it triggers whatever filters might be set?
Reparse "...processed as spam..." and I think you'll find it means exactly that. > To become truly useful the anti-spam programs need to become > more sophisticated and be able to discern the differance between > spam and legititmate messages that might quote spam. Of course > some people might want to eliminate anything that might even smell > of spam or look like spam even if it's really a gourmet pate. > > But I'd want a spam filter to be able to make fine discriminations > and not take broad approach like blocking domains or blocking > any mention of "Nigeria" or "breasts" or whatever. Procmail uses a simple conditional check. If the body has the phrase "[Bb]reast [Ee]nlargement," do something with the message. If the headers contain "From:.*caramail.com" (where .* is a wildcard for any number of characters) then do something (else). If headers contain "Sender: [EMAIL PROTECTED]" then the message goes to the arachne folder. Procmail is very useful for "check for this condition... if <foo> is encountered, do <bar>." Though somewhat simplistic, it's very effective. It's also somewhat time consuming, in that you have to create a new rule for every spam that makes it through your ever growing number of filters... and then what happens when you've blocked the IP address for a spammer haven after it goes out of business and a year later, it belongs to an ISP you *want* to get mail from? Not only do you have to constantly make new rules for new spammer strategies, but you also have to figure out a way to track and expire rules that are no longer useful. In the example above, I got spam from caramail, reported it to their published abuse address, and when nothing was done, and I continued getting spam from caramail.com, I added a rule to send all mail from caramail straight to /dev/null. I probably don't have to worry that someone will come along and buy the domain name... and then become an upstanding ISP. The rule is effective, and even if caramail eventually expires, odds are that leaving the rule intact will do no harm. As to a spam embedded within a list e-mail... sure, I could put the arachne rule before the spam rules, which would send every message from the arachne list to the arachne folder. Matter of fact, I used to have it arranged that way... but my patience with spam is thin, so it only took a few showing up on the list to make me reverse the order of those rules. Now, mail has to pass the filters even if it is from the arachne list. Procmail wasn't designed as a spamfilter per se (though it certainly functions as one). As a matter of fact, one of the things it does is mail lists, and could easily replace majordomo at arachne.cz. SpamAssassin evaluates mail more on a cumulative basis than conditional. It assigns a certain number of points for each potential spam signal (configurable so you can change values to better reflect your situation), and you then set the threshold, x, under which mail should go to your inbox. Between x and y, send it to a special spam folder; over y, go directly to /dev/null. Or you can have a single x. Messages under x go to your inbox. Miessages over x go to either a spamfolder or /dev/null, whichever you prefer. So, just having the phrase "breast enlargement" isn't by itself, enough to keep an e-mail out of your inbox. If the e-mail had that, plus a message-id generated by your local MTA, exclamation points in the subject, and a line or two of ALL CAPS, then it would likely go over any reasonable threshold you'd set for non-spam. Still, a message that had the entire text of a Nigerian scam would be processed as spam. The cumulative signals in the spam would total more than your threshold regardless of additional text being present. Right now, I use procmail spam rules that send mail to /dev/null first. These are mainly IP blocks that have shown no interest in controlling spam through their systems... a blacklist if you will. Then come the spamassassin tests. Then some more procmail recipes. Pretty soon I'll also add in a Bayesian/heuristic type filter. Now we get to the newer stuff. Most filters using Bayesians also use heuristics. The Bayesian aspect is a way of determining the statistical probability that a message is spam based on a database of tokens built up from your training set. You tell the filter, "This one is spam, this one isn't. This is spam, this is spam, this is spam, this isn't. The filter learns what words and combinations of words have a statistical probability of being spam. Generally, after the first day of training, you're at 97% accuracy. One more day takes you to 98%, and one more day takes you to 99%. After that, you only need to train on errors. Within about a week, it's not uncommon to hit 99.8% accuracy on recognizing the difference between spam and ham. (That seems to be the up and coming term for non-spam.) With this type of filter, it would be possible to recognize a paragraph or group of text within an e-mail as spam, and another group as ham. At that point, you could teach it to pass that type of message on to you. Unfortunately, the spammers already thought of that angle, and many are now including 10K worth of literature along with 2K of spam in an attempt to circumvent probability filters set up as you've suggested. It would seem that the Bayesian/heuristic spam filter might be the last one you'll ever need. Ostensibly, once the spammers come up with a new strategy, you simply train your filters on it. Much as I'd like to believe that, I know that under the current economics of e-mail, spammers are making a lot of money for little risk. As long as the monetary incentive is there, spammers will continue to figure ways around filters, even filters with the ability to learn. I'm looking at bogofilter and crm114 right now (Windoze users might want to look at POPfile), and once I'm familiar with them, I'll decide whether I want to add one to what I'm already doing, or maybe replace spamassassin all together. Who knows... maybe it *will* be the last filter I ever have to install. P.S. I just added one more small IP block from interbusiness.it for procmail to send straight to /dev/null. First time I've seen a spam from them since October. # interbusiness.it 2003/02 :0 * ^Received: from.*\[217.56.66.[160-167] /dev/null That's the 8th IP block of theirs I've /dev/null'd. -- Steve Ackman http://twoloonscoffee.com (Need green beans?) http://twovoyagers.com (glass, linux & other stuff)
