It appears to me that HTML::Parse, and, by inheritance,
HTML::TokeParser and HTML::TreeBuilder, treats invalid elements as
text. While this is valid, and is in fact the recommended behavior,
actual practice of browsers is to ignore the invalid elements. I've
been unable to find a setting to make the parsers do the same. Is
there one I'm missing, should I look into adding something, or is
there a reason that this should not be done?
The specific reason for asking is for enhancing the operation
of SpamCop. Some spammers are putting invalid elements like:
<!a
href="http://3274458682/sang.yong/index.html"}My_Site{/a}{/font}{/center}{/t
d>
into HTML spam. Most browsers don't display this, but since
HTML::Parser does, SpamCop picks up that URL as being spamvertised.
(SpamCop has to scan the text of the message for URLs as some spammers
place URLs in the text and ask the victim to cut and paste.)
--
#include <disclaimer.h> /* Sten Drescher */
"This is the *NIX version of the 'ILOVEYOU' worm. It runs on the honor
system. Forward this to everyone in your address book, and randomly delete
some of your files." - Unknown