On Sat, 1 Feb 2003, Sam Ewalt wrote:

> > On Thu, 30 Jan 2003 Steve <[EMAIL PROTECTED]> writes:
> >>   If you quote spam, your entire e-mail is processed as
> >> spam, i.e., it *becomes* spam.
> 
> Would it be more accurate to say that quoting spam would make
> the entire email "appear" as if it was spam to anti-spam
> filtering programs that it might encounter?  And only then if
> it triggers whatever filters might be set?

  Reparse "...processed as spam..." and I think you'll find 
it means exactly that.
 
> To become truly useful the anti-spam programs need to become
> more sophisticated and be able to discern the differance between
> spam and legititmate messages that might quote spam. Of course
> some people might want to eliminate anything that might even smell
> of spam or look like spam even if it's really a gourmet pate.
>
> But I'd want a spam filter to be able to make fine discriminations
> and not take broad approach like blocking domains or blocking
> any mention of "Nigeria" or "breasts"  or whatever.

  Procmail uses a simple conditional check.  If the body has
the phrase "[Bb]reast [Ee]nlargement," do something with the 
message.  If the headers contain "From:.*caramail.com" 
(where .* is a wildcard for any number of characters) then 
do something (else).  
  If headers contain "Sender: [EMAIL PROTECTED]" 
then the message goes to the arachne folder. 

  Procmail is very useful for "check for this condition... 
if <foo> is encountered, do <bar>."

  Though somewhat simplistic, it's very effective.  It's 
also somewhat time consuming, in that you have to create a 
new rule for every spam that makes it through your ever 
growing number of filters... and then what happens when 
you've blocked the IP address for a spammer haven after it 
goes out of business and a year later, it belongs to an ISP 
you *want* to get mail from?  Not only do you have to 
constantly make new rules for new spammer strategies, but 
you also have to figure out a way to track and expire rules 
that are no longer useful.

  In the example above, I got spam from caramail, reported 
it to their published abuse address, and when nothing was 
done, and I continued getting spam from caramail.com, I 
added a rule to send all mail from caramail straight to 
/dev/null.  I probably don't have to worry that someone 
will come along and buy the domain name... and then become 
an upstanding ISP.  The rule is effective, and even if 
caramail eventually expires, odds are that leaving the rule 
intact will do no harm.

  As to a spam embedded within a list e-mail... sure, I 
could put the arachne rule before the spam rules, which 
would send every message from the arachne list to the 
arachne folder.   Matter of fact, I used to have it arranged 
that way... but my patience with spam is thin, so it only 
took a few showing up on the list to make me reverse the 
order of those rules.  Now, mail has to pass the filters 
even if it is from the arachne list.

  Procmail wasn't designed as a spamfilter per se (though 
it certainly functions as one).  As a matter of fact, one of 
the things it does is mail lists, and could easily replace 
majordomo at arachne.cz.


  SpamAssassin evaluates mail more on a cumulative basis 
than conditional.  It assigns a certain number of points for 
each potential spam signal (configurable so you can change 
values to better reflect your situation), and you then set 
the threshold, x, under which mail should go to your inbox.  
Between x and y, send it to a special spam folder; over y, 
go directly to /dev/null.
  Or you can have a single x.  Messages under x go to your 
inbox.  Miessages over x go to either a spamfolder or 
/dev/null, whichever you prefer.

  So, just having the phrase "breast enlargement" isn't by 
itself, enough to keep an e-mail out of your inbox.  If the 
e-mail had that, plus a message-id generated by your local 
MTA, exclamation points in the subject, and a line or two of 
ALL CAPS, then it would likely go over any reasonable 
threshold you'd set for non-spam.

  Still, a message that had the entire text of a Nigerian 
scam would be processed as spam.  The cumulative signals in 
the spam would total more than your threshold regardless of 
additional text being present.

  Right now, I use procmail spam rules that send mail to 
/dev/null first.  These are mainly IP blocks that have shown 
no interest in controlling spam through their systems... a 
blacklist if you will.  Then come the spamassassin tests.  
Then some more procmail recipes.  Pretty soon I'll also add 
in a Bayesian/heuristic type filter.  

  Now we get to the newer stuff.  Most filters using 
Bayesians also use heuristics.  The Bayesian aspect is a way 
of determining the statistical probability that a message is 
spam based on a database of tokens built up from your 
training set.  You tell the filter, "This one is spam, this 
one isn't.  This is spam, this is spam, this is spam, this 
isn't.  The filter learns what words and combinations of 
words have a statistical probability of being spam.  
Generally, after the first day of training, you're at 97% 
accuracy.  One more day takes you to 98%, and one more day 
takes you to 99%.  After that, you only need to train on 
errors.  Within about a week, it's not uncommon to hit 99.8% 
accuracy on recognizing the difference between spam and ham.
(That seems to be the up and coming term for non-spam.)

  With this type of filter, it would be possible to 
recognize a paragraph or group of text within an e-mail as 
spam, and another group as ham.  At that point, you could 
teach it to pass that type of message on to you.  
Unfortunately, the spammers already thought of that angle, 
and many are now including 10K worth of literature along 
with 2K of spam in an attempt to circumvent probability 
filters set up as you've suggested.

  It would seem that the Bayesian/heuristic spam filter 
might be the last one you'll ever need.  Ostensibly, once 
the spammers come up with a new strategy, you simply train 
your filters on it.  Much as I'd like to believe that, I 
know that under the current economics of e-mail, spammers 
are making a lot of money for little risk.  As long as the 
monetary incentive is there, spammers will continue to 
figure ways around filters, even filters with the ability 
to learn.

  I'm looking at bogofilter and crm114 right now (Windoze 
users might want to look at POPfile), and once I'm familiar 
with them, I'll decide whether I want to add one to what I'm 
already doing, or maybe replace spamassassin all together.   
Who knows... maybe it *will* be the last filter I ever have 
to install.  


  P.S.  I just added one more small IP block from 
interbusiness.it for procmail to send straight to /dev/null.
First time I've seen a spam from them since October.

# interbusiness.it 2003/02
:0
* ^Received: from.*\[217.56.66.[160-167]
/dev/null

  That's the 8th IP block of theirs I've /dev/null'd.
  
-- 
Steve Ackman
http://twoloonscoffee.com       (Need green beans?)
http://twovoyagers.com          (glass, linux & other stuff)




Reply via email to