Sten <[EMAIL PROTECTED]> writes:

>       It appears to me that HTML::Parse, and, by inheritance,
> HTML::TokeParser and HTML::TreeBuilder, treats invalid elements as
> text.  While this is valid, and is in fact the recommended behavior,
> actual practice of browsers is to ignore the invalid elements.  I've
> been unable to find a setting to make the parsers do the same.  Is
> there one I'm missing, should I look into adding something, or is
> there a reason that this should not be done?

I don't think you are missing anything.

>       The specific reason for asking is for enhancing the operation
> of SpamCop.  Some spammers are putting invalid elements like:
> 
> <!a
> href="http://3274458682/sang.yong/index.html"}My_Site{/a}{/font}{/center}{/t
> d>

One principle we have to follow is that HTML-Parser can't simply
ignore it.  It has to generate some kind of event, and you say it
should not be 'text'.  What should it be?

A rule that might work is that <!...> stuff that does not parse as
valid declaration or marked section or comment is returned as a
comment anyway.  That is changing the stuff in parse_decl() at FAIL:
to scan for a ">" and return a 'comment'.  Do you want to try to
provide a patch?

Another possibility is to introduce a new event-type for this instead
of classifying it as comments.  For instance 'declaration-junk' or
perhaps 'declaration' is good enough (with a single token).

But, perhaps it's not as simple as that after all?  If I try to let
netscape display:

   foo <! ignore "ba>" ignore > bar

it ends up with "foo bar", so it actually ignored the > inside the
quotes.  Is there anybody who can actually tell me what rules "major
browsers" use for skipping junky <!...> stuff?  Somebody here who have
read the Mozilla code??

I also wonder how this all interact with the bug report earlier that
lead to the following change:

| Declaration parsing mode now only triggers for <!DOCTYPE ...> and
| <!ENTITY ...>.  Based on patch by la mouton <[EMAIL PROTECTED]>.

Regards,
Gisle

Reply via email to