On Tuesday, March 22, 2005, 8:31:07 PM, Andrew wrote: <snip/>
CA> How many times have we all been frustrated that a piece of spam ending CA> up in *OUR* mailbox that was soooo close in content to spam we whacked CA> yesterday? CA> I thought the top "n" obfuscations might be interesting to look at, and CA> perhaps a shortcut (temporary, albeit) for spam catching. I thought we CA> might see whether, for example, broken URLs, fake comments, or high-bit CA> ASCII character substitutions were the obfuscation technique du jour. Here you hit it IMHO. The reality appears to be, from my experience, that small domains of obfuscation patterns rise and fall like swells on the ocean. That is, stability tends to arise in one domain of message characteristics and then fall to rise in another domain. Sometimes the domain is well understood and sometimes it is entirely new and forces us to think differently about what a "feature" really is. By domain I mean things like message structure, word obfuscation techniques, phrase based swapping, html exploitation, etc... The "du jour" part of your statement is a key element to the problem. Defining and re-defining the conceptual framework that describes feature domains in the spam is the other key element. Put more simply - knowing what to look for is a basic element, but it gets you nowhere on it's own. Knowing (recognizing) when to look for the "what" is the key that makes the problem workable. CA> I while back curiousity got the better of me (it was raining, and CA> I had a few days off) and I did a few grep sweeps on a warm spam CA> corpus. CA> I was disappointed in my success rate for: CA> v.?i.?a.?g.?r.?a.? CA> and similar queries with deliberately substitutions (e.g. using a "1" CA> for "i"). I started writing a grep-generating-permutation engine and CA> decided my time was better spent on scritching my cat under his chin. That is a nifty direction that I wish I had more time for. Perhaps I will some day soon when Sniffer get's slashdotted and sales go through the roof! --- meantime, back on this planet, I suggested a very similar thing to Paul Graham at the first spam conference at MIT. As I recall he said it was "ambitious" - a description that I have learned has a special meaning in scientific circles. Something having to do with avian swine and snowballs that have successful careers as tour guides in hell. One of these days I think I might do it anyway, just to prove the point, but in the mean time I too prefer to spend more time with my cat. ;-) Don't get me wrong - I strongly believe it can be done this way, but it requires much more than good technology. It runs right into one of the biggest problems with AI and, perhaps more importantly, people's expectations of AI. No matter how good the pattern learning system might be it will always lack the human experience. Computers don't date or gain weight - so they have a hard time understanding what much of the spam is "about" simply by looking at the patterns. That's why the Message Sniffer process is designed with people tightly integrated into the system. _M This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html