Hey guys, I recently wrote a paper (unfortunately, it's in portuguese) on using the web pages linked in spams to help classify them. My methodology was as follows: 1- Download spam messages from Spam Archive [1] from 07/2010 to 12/2010
2- Download the web pages linked by those messages. This had to be done on a daily base, since spam web pages usually have a small lifetime. The computer I was using crashed a few times, but I was able to download 157.114 webpages. 3- Download the web pages linked by ham messages. I used the Spam Assassin ham corpora (which is pretty old), but some of the pages were still up. 4- I tried weeding out campaigns, so I ran spam assassin on all the messages I got the links from. I then removed all the duplicates (a message was considered a duplicate if it pointed to the exact webpage as another message and had the same spam assassin score and rule output). I ended up with 12111 spam messages and 4927 ham messages. I realize this is a naive way of removing campaigns. It is worth noting that I only considered messages that had URL, for obvious reasons (I can't use webpages to classify messages that don't). 5- Next, I classified the webpages using an associative classifier [2]. I did 5 fold cross validation[3] to avoid overfitting. My classifier gave me as output a predicted class (spam or ham) and a confidence (in percentage). I then multiplied the confidence value given by the classifier by some weight, and that gave me the final score for the page (which would be negative if it was classified as ham) . I then added this score to the Spam Assassin score of that message. I used Spam Assassin with default parameters, except I disabled bayes. I attached an image that shows the results for different weights associated with the message. The X axis is the weight I multiplied by the confidence of the classifier, and the Y axis is the error rate. The blue line on the bottom is Spam Assassin's false positive rate, while the purple line on the top is Spam Assassin's false negative rate. The red line is the false positive rate using spam assassin + webpages, and the green line is the false negative rate. For weight=3 for the page, I have the same false negative rate and I classify correctly about 96% of the messages, while Spam Assassin classifies about 87%. I am aware of the Web Redirect plugin [4], but it was last updated in 2006. Is it too expensive to query for webpages? Does the cost make this approach useless? I was initially thinking of trying to implement this on Spam Assassin as a Google Summer of Code project, but it is such a basic task that (if it's usable) I could probably do it in no time. The classifier I used outputs readable rules, so it would be a piece of cake to translate them into regular expressions. And it seems spammers don't even bother trying to obfuscate the web pages (or maybe they don't even have control over them). For example, 36.7% of the webpages I downloaded contained the word viagra in them, and 99.84% of them were spam (the 0.16% probably was as well, it was probably due to some minor error). What do you guys think? Is it worth trying? Any ideas? Thanks in advance, Marco TĂșlio Correia Ribeiro References: [1] - http://untroubled.org/spam/ [2] - http://www.dcc.ufmg.br/~adrianov/papers/ICDM06/Veloso-icdm06.pdf [3] - http://en.wikipedia.org/wiki/Cross-validation_(statistics) [4] - http://wiki.apache.org/spamassassin/WebRedirectPlugin
