[Mimedefang] HTMLCheck

Joseph Brennan Sat, 18 Mar 2006 07:11:30 -0800


The recent talk about anchors with "https" text and "http" links
has nudged me to post the code I've been working on that uses
HTML::TokeParser to look for bad things in HTML.  See the wiki
<http://www.mimedefang.com/kwiki/index.cgi?FilterExamples>
under HTMLCheck.


So far I have not seen HTML::TokeParser confused by all the
obfuscation tricks used by spammers.  It takes care of identifying
all the tags and their attributes, and categorizes them for us as
start, end, text, etc.  All I had to code was what to look for and
what to do with it.

HTMLCheck actually changes messages, by commenting out some of
the bad things with <!-- tags -->.  For other bad things it asks
for the message to be rejected.

Among the bad things are those mismatched anchors.  We compare
the domains of the visible and real urls.  If they do not
match, we comment out the anchor tag leaving the visible url
to be copy-pasted if wanted.  If the visible is https and the
real is http and the domains do not match, reject.  Thus...

<a href="http://foo.com";>http://bar.com</a> becomes
<!-- <a href="http://foo.com";> -->http://bar.com<!-- </a> -->

<a href="https://foo.com";>http://foo.com</a> becomes
<!-- <a href="https://foo.com";> -->http://foo.com<!-- </a> -->

<a href="https://foo.com";>http://bar.com</a> is rejected.


Joseph Brennan
Columbia University Information Technology








_______________________________________________
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list [email protected]
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang

[Mimedefang] HTMLCheck

Reply via email to