Hey guys,
I recently wrote a paper (unfortunately, it's in portuguese) on using
the web pages linked in spams to help classify them. My methodology
was as follows:
1- Download spam messages from Spam Archive [1] from 07/2010 to 12/2010

2- Download the web pages linked by those messages. This had to be
done on a daily base, since spam web pages usually have a small
lifetime. The computer I was using crashed a few times, but I was able
to download 157.114 webpages.

3- Download the web pages linked by ham messages. I used the Spam
Assassin ham corpora (which is pretty old), but some of the pages were
still up.

4- I tried weeding out campaigns, so I ran spam assassin on all the
messages I got the links from. I then removed all the duplicates (a
message was considered a duplicate if it pointed to the exact webpage
as another message and had the same spam assassin score and rule
output). I ended up with 12111 spam messages and 4927 ham messages. I
realize this is a naive way of removing campaigns. It is worth noting
that I only considered messages that had URL, for obvious reasons (I
can't use webpages to classify messages that don't).

5- Next, I classified the webpages using an associative classifier
[2]. I did 5 fold cross validation[3] to avoid overfitting. My
classifier gave me as output a predicted class (spam or ham) and a
confidence (in percentage). I then multiplied the confidence value
given by the classifier by some weight, and that gave me the final
score for the page (which would be negative if it was classified as
ham) . I then added this score to the Spam Assassin score of that
message. I used Spam Assassin with default parameters, except I
disabled bayes.

I attached an image that shows the results for different weights
associated with the message. The X axis is the weight I multiplied by
the confidence of the classifier, and the Y axis is the error rate.
The blue line on the bottom is Spam Assassin's false positive rate,
while the purple line on the top is Spam Assassin's false negative
rate. The red line is the false positive rate using spam assassin +
webpages, and the green line is the false negative rate. For weight=3
for the page, I have the same false negative rate and I classify
correctly about 96% of the messages, while Spam Assassin classifies
about 87%.

I am aware of the Web Redirect plugin [4], but it was last updated in
2006. Is it too expensive to query for webpages? Does the cost make
this approach useless? I was initially thinking of trying to implement
this on Spam Assassin as a Google Summer of Code project, but it is
such a basic task that (if it's usable) I could probably do it in no
time. The classifier I used outputs readable rules, so it would be a
piece of cake to translate them into regular expressions. And it seems
spammers don't even bother trying to obfuscate the web pages (or maybe
they don't even have control over them). For example, 36.7% of the
webpages I downloaded contained the word viagra in them, and 99.84% of
them were spam (the 0.16% probably was as well, it was probably due to
some minor error). What do you guys think? Is it worth trying? Any
ideas?

Thanks in advance,
Marco TĂșlio Correia Ribeiro

References:
[1] - http://untroubled.org/spam/
[2] - http://www.dcc.ufmg.br/~adrianov/papers/ICDM06/Veloso-icdm06.pdf
[3] - http://en.wikipedia.org/wiki/Cross-validation_(statistics)
[4] - http://wiki.apache.org/spamassassin/WebRedirectPlugin

Reply via email to