On Tuesday, March 22, 2005, 8:31:07 PM, Andrew wrote:

<snip/>

CA> How many times have we all been frustrated that a piece of spam ending
CA> up in *OUR* mailbox that was soooo close in content to spam we whacked
CA> yesterday?

CA> I thought the top "n" obfuscations might be interesting to look at, and
CA> perhaps a shortcut  (temporary, albeit) for spam catching.  I thought we
CA> might see whether, for example, broken URLs, fake comments, or high-bit
CA> ASCII character substitutions were the obfuscation technique du jour.

Here you hit it IMHO. The reality appears to be, from my experience,
that small domains of obfuscation patterns rise and fall like swells
on the ocean. That is, stability tends to arise in one domain of
message characteristics and then fall to rise in another domain.
Sometimes the domain is well understood and sometimes it is entirely
new and forces us to think differently about what a "feature" really
is.

By domain I mean things like message structure, word obfuscation
techniques, phrase based swapping, html exploitation, etc...

The "du jour" part of your statement is a key element to the problem.
Defining and re-defining the conceptual framework that describes
feature domains in the spam is the other key element.

Put more simply - knowing what to look for is a basic element, but it
gets you nowhere on it's own. Knowing (recognizing) when to look for
the "what" is the key that makes the problem workable.

CA> I while back curiousity got the better of me (it was raining, and
CA> I had a few days off) and I did a few grep sweeps on a warm spam
CA> corpus.

CA> I was disappointed in my success rate for:

CA> v.?i.?a.?g.?r.?a.?

CA> and similar queries with deliberately substitutions (e.g. using a "1"
CA> for "i").  I started writing a grep-generating-permutation engine and
CA> decided my time was better spent on scritching my cat under his chin.

That is a nifty direction that I wish I had more time for. Perhaps I
will some day soon when Sniffer get's slashdotted and sales go through
the roof!

--- meantime, back on this planet, I suggested a very similar thing to
Paul Graham at the first spam conference at MIT. As I recall he said
it was "ambitious" - a description that I have learned has a special
meaning in scientific circles. Something having to do with avian swine
and snowballs that have successful careers as tour guides in hell.

One of these days I think I might do it anyway, just to prove the
point, but in the mean time I too prefer to spend more time with my
cat. ;-)

Don't get me wrong - I strongly believe it can be done this way, but
it requires much more than good technology. It runs right into one of
the biggest problems with AI and, perhaps more importantly, people's
expectations of AI. No matter how good the pattern learning system
might be it will always lack the human experience. Computers don't
date or gain weight - so they have a hard time understanding what much
of the spam is "about" simply by looking at the patterns. That's why
the Message Sniffer process is designed with people tightly integrated
into the system.

_M




This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html

Reply via email to