Andy Schmidt wrote:
Hey Matt:
One question - I know that you have been spending a lot of time programming
content filters.
I'm curious whether you are using Sniffer and whether you found that you
needed all those filters to improve detection over Sniffer rules (which then
makes me wonder why they are not made part of Sniffer) - or whether you are
trying to substitute Sniffer?
I'm not trying to substitute Sniffer, but I see no reason to be heavily
dependent on it either. Sniffer is a critical component on our system
and it hits 94% to 97% of the messages that we block on a daily basis.
The results on pure spam is probably a bit higher, but for instance we
are blocking about 2% of our volume as Joe-Job bounces and there are
other things that get blocked that aren't technically spam, but is
garbage, and while Sniffer does hit on much of this stuff, it does in
lower numbers.
I consider Sniffer primarily to be my substitute for content filtering.
Instead of tagging the wordage, it tags the links primarily (some
exceptions of course). When combined with other filters, it is much
more powerful than both alone, and the same thing goes of our custom
filters. So for instance, if we get a DUL hit plus Sniffer hit, the
confidence in it being spam goes up and we add extra points for that
condition as well as many others, this also allows us to lower the
scores on both Sniffer and DUL hits (and others) because combination
filters are like multipliers, and they often hit in combination. At the
same time however we were finding that a good deal of obvious DUL stuff
wasn't hitting on the DNSBL's that we use so we started creating our own
DUL filters based on reverse DNS entries using the new NOTCONTAINS
functionality (required for this sort of work). We are now tagging 20%
more DUL hits as a result, and doing it more reliably than before in
fact (we defeat the filter when IPNOTINMX is not hit, meaning that an MX
record has been created for the domain to point to that DUL space, thus
allowing servers from such space to connect without punishment). I
actually consider most of my filters to be technical heuristics
instead of content filters because I'm looking for patterns in almost
all of them and not words or phrases.
I've gotten serious about pushing a business model for spam blocking in
recent months and word-of-mouth combined with old-fashioned sales has
brought us a good deal of business for a company that hasn't even
launched a site or done any advertising. Our spam blocking percentage
is about 99.7% on our Medium setting (Hold at 13). While that is
definitely much better than the big players and impossible to beat
measureably, I figure that over time the big players will catch up or
come a lot closer. What makes us special though is that we have managed
to segregate the blocked messages so that 99% of it lands in what we
call Drop (score of 25+) and 1% of it lands in Hold (score of 10 or
13-24), and along with that comes other associated capabilities. We are
able to review our Hold file for every one of our customers on a daily
basis because the work load is so little, for instance yesterday out of
just over 52,000 blocked messages, only 465 landed in our Hold range
(0.89%). We advise our customers to review this themselves and by not
mixing in 100% of the spam for them to review, it makes it much more
likely that they will do so. Naturally not all false positives will
land in our Hold range, but I have never seen a personal message land in
our Drop range, and it's generally very gray stuff that lands in Drop
such as some newsletter that uses the services of a company that
primarily engages in spamming (I've only caught this 3 times in Drop,
but it should be more than 99.99% accurate). We try to get all mixed
sources to land in Hold however, but sometimes Sniffer helps to push
some over the top and of course we also make mistakes. Yesterday we
found and reprocessed 9 false positives (personal E-mail and
newsletters) out of 52,000 messages blocked, and we resolved the
conditions that created every one of them so that they would no longer
have issues. There was some additional advertising content that is
questionable that was blocked as well but those things generally require
more research and are not handled immediately as they are not missed.
Without Sniffer our accuracy would go down and the size of our hold file
would go up, and we would leak more spam, but we would survive and
that's important because we can't become completely dependent on any
single source of data as that represents a liability.
Sniffer has played a major role in our ability to do all of this, but on
it's own it's just another tool, albeit one that hits the vast majority
of spam, and it's up to the administrator to make as much as they can of
it. By creating pattern filters and also our own RBL, we are able to
achieve better differentiation between spam and