Sten <[EMAIL PROTECTED]> writes:
> It appears to me that HTML::Parse, and, by inheritance,
> HTML::TokeParser and HTML::TreeBuilder, treats invalid elements as
> text. While this is valid, and is in fact the recommended behavior,
> actual practice of browsers is to ignore the invalid elements. I've
> been unable to find a setting to make the parsers do the same. Is
> there one I'm missing, should I look into adding something, or is
> there a reason that this should not be done?
I don't think you are missing anything.
> The specific reason for asking is for enhancing the operation
> of SpamCop. Some spammers are putting invalid elements like:
>
> <!a
> href="http://3274458682/sang.yong/index.html"}My_Site{/a}{/font}{/center}{/t
> d>
One principle we have to follow is that HTML-Parser can't simply
ignore it. It has to generate some kind of event, and you say it
should not be 'text'. What should it be?
A rule that might work is that <!...> stuff that does not parse as
valid declaration or marked section or comment is returned as a
comment anyway. That is changing the stuff in parse_decl() at FAIL:
to scan for a ">" and return a 'comment'. Do you want to try to
provide a patch?
Another possibility is to introduce a new event-type for this instead
of classifying it as comments. For instance 'declaration-junk' or
perhaps 'declaration' is good enough (with a single token).
But, perhaps it's not as simple as that after all? If I try to let
netscape display:
foo <! ignore "ba>" ignore > bar
it ends up with "foo bar", so it actually ignored the > inside the
quotes. Is there anybody who can actually tell me what rules "major
browsers" use for skipping junky <!...> stuff? Somebody here who have
read the Mozilla code??
I also wonder how this all interact with the bug report earlier that
lead to the following change:
| Declaration parsing mode now only triggers for <!DOCTYPE ...> and
| <!ENTITY ...>. Based on patch by la mouton <[EMAIL PROTECTED]>.
Regards,
Gisle