Re: RFC: New subproject, BlogSpamAssassin

Matthew Mullenweg 22 Dec 2004 20:57:11 -0000

Henry Stern wrote:

Interesting plugin.  However, I'm a bit skeptical of how well
content-based filtering will work for blog spam.  The main difference
between e-mail spam and weblog spam is that e-mail spam is intended to
be read by a person, whereas blog spam is intended to be read by a
search engine's spider.

My experience has been content filtering can be very effective, because no one wants to be first on Google for "v1agra". Therefore obscuration techniques they can use are somewhat limited. WP had virtually no spam until they found a bug in older versions where they could use lower numeric entities (like e for "e") to get past the *very* basic moderation filters we had in place and still be read correctly by Google. The WordPress plugin community has been very active in addressing this problem, so let me take a moment to point out some of the tools currently out there:

===
http://elliottback.com/wp/archives/2004/11/29/spam-stopgap-extreme/
http://dev.wp-plugins.org/browser/wp-hashcash/trunk/

This is a JS proof of work implementation that has been extremely (100%) effective in blocking non-human spam thus far. This is the only technique of this type that has worked more than about a week, other modifications such as adding random fields, asking questions in the comment form, and changing the URI of the comment post script have been bypassed by the bots within a few days.

Things along this line will not be effective in the long run because there is a commenting protocol popularized by Six Apart designed specifically for no human involvement, TrackBack. This is a essential feature to many bloggers.

http://www.movabletype.org/trackback/

Pingback is more robust and requires a link back, but can still be spoofed:

http://www.hixie.ch/specs/pingback/pingback

The approach we're taking to that is white listing of URIs in the WP-integrated blogroll and moderation of others, we also don't allow any markup within these comments.

===
http://wordpress.org/development/2004/12/fight-spam/
http://mookitty.co.uk/devblog/category/kittens-spaminator/
http://www.unknowngenius.com/blog/static/spam-karma

These are the two plugins that combined about a dozen different efforts that were going on. Both have a scoring system very much like SpamAssassin in some ways that uses content characteristics, RBL lookups, user agent characteristics (how long it was on the page before, is it coming through a proxy) and contextual characteristics like the age of the post. Spaminator has a "tar pit" which tries to delay bots when one has been identified by inserting random delays before responses. This seems to have pissed them off enough because now several of the bots check for the Spaminator files before targeting a weblog. Spam Karma is interesting because if your comment is borderline spam (right on the threshold) you can get it through by filling out a image CAPTCHA or responding to an email confirmation, thus it combines CAPTCHA with an accessible alternative.

===
Others

I've seen some interesting talk of centralized/decentralized systems, which operate much like razor or pyzor except the server is freely available and easy to install as an add-on to WordPress. Submissions can come from trusted sources with keys and then a web of trust can be extended out by utilizing XFN metadata that WordPress supports in its blogrolls.

http://gmpg.org/xfn/

This could be very interesting, as it would be hard to target in a central fashion (there can be hundreds/thousands of "servers") and it doesn't require much manual intervention by the person running the plugin, just the person running the server has to be proactive. It could also scale well. However the code for this isn't ready for release yet, it's undergoing a security audit and review.

===
Tool level

On the core WordPress level I've been focused on bugs that could allow bypassing the content filters (like the numeric entity thing) and making the attack surface as small as possible. WP has a nice moderation system where you can say a comment needs to be approved manually before it will show up on the site, so enabling this automatically for old or inactive discussions is a great way to make the "open targets" fewer and still not kill conversation on older entries. (Most bloggers *love* comments and the thought of missing some is painful.)

So, I hope that's a helpful overview to get the conversation started.
--
Matt Mullenweg
 http://photomatt.net | http://wordpress.org
http://pingomatic.com | http://cnet.com

Re: RFC: New subproject, BlogSpamAssassin

Reply via email to