The recent talk about anchors with "https" text and "http" links has nudged me to post the code I've been working on that uses HTML::TokeParser to look for bad things in HTML. See the wiki <http://www.mimedefang.com/kwiki/index.cgi?FilterExamples> under HTMLCheck.
So far I have not seen HTML::TokeParser confused by all the obfuscation tricks used by spammers. It takes care of identifying all the tags and their attributes, and categorizes them for us as start, end, text, etc. All I had to code was what to look for and what to do with it. HTMLCheck actually changes messages, by commenting out some of the bad things with <!-- tags -->. For other bad things it asks for the message to be rejected. Among the bad things are those mismatched anchors. We compare the domains of the visible and real urls. If they do not match, we comment out the anchor tag leaving the visible url to be copy-pasted if wanted. If the visible is https and the real is http and the domains do not match, reject. Thus... <a href="http://foo.com">http://bar.com</a> becomes <!-- <a href="http://foo.com"> -->http://bar.com<!-- </a> --> <a href="https://foo.com">http://foo.com</a> becomes <!-- <a href="https://foo.com"> -->http://foo.com<!-- </a> --> <a href="https://foo.com">http://bar.com</a> is rejected. Joseph Brennan Columbia University Information Technology _______________________________________________ NOTE: If there is a disclaimer or other legal boilerplate in the above message, it is NULL AND VOID. You may ignore it. Visit http://www.mimedefang.org and http://www.roaringpenguin.com MIMEDefang mailing list [email protected] http://lists.roaringpenguin.com/mailman/listinfo/mimedefang

