On Aug 22, 2010, at 15:25 , Gabor Szabo wrote:

> I am not sure where to post this HTML::Parser related question but
> this list seems to be close and Gisle is frequent here (and the Perl
> Monks did not give me a solution) so I hope you'll forgive me:

No reason to forgive :-)  This list is the recommended place to raise issued 
with the HTML-Parser dist as well.

> Using HTML:Parser it is unclear to me how am I supposed to notice when
> a tag - that's end tag is missing has indeed ended? It seems that in
> some cases I get an explicit end event but in other cases I don't.

The general rule is that HTML::Parser give you back a stream of tags that 
correspond exactly to the text of the parsed document.  If you want implicit 
tags to be intuited you have to use modules like those found in the HTML-Tree 
dist.

But there are seven tags that HTML::Parser treat specially;  <script>, <style>, 
<xmp>, <iframe>, <plaintext>, <title> and <textarea>.  After one of these start 
tags have been seen no other tag is recognized until the corresponding end tag 
is seen.

What you experienced with <title> is the heuristics HTML::Parser uses to 
recover when it did not find any "</title>" before it hit the end of the 
document.  In this case it inserts a fake </title> just before the first tag 
found after <title>.  The rules for recovery differs between the special tags.  
For <script> and <style> an empty element is preferred.  For <plaintext>, 
<xmp>, <iframe>, <textarea> all the rest of the document is considered text.

Hope this helps!

Regards,
Gisle


> 
> See the example code:
> 
> use strict;
> use warnings;
> 
> use HTML::Parser ();
> 
> sub event_handler {
>    my ($event, $elem) = @_;
>    print "$event $elem\n";
> }
> 
> 
> my $p = HTML::Parser->new(api_version => 3);
> $p->handler( start => \&event_handler, "event, tagname");
> $p->handler( end   => \&event_handler, "event, tagname");
> $p->parse('<head><title>abc</title></head>');
> $p->eof;
> print "----\n";
> $p->parse('<head><title>abc</head>');
> $p->eof;
> 
> print "----\n";
> $p->parse('<ul><li>abc</li><li>def</ul>');
> $p->eof;
> exit;
> 
> The result of which is
> 
> start head
> start title
> end title
> end head
> ----
> start head
> start title
> end title
> end head
> ----
> start ul
> start li
> end li
> start li
> end ul
> 
> That is, the missing </title> tag explicitly generated and end-even
> while the missing </li> did not.
> 
> 
> regards
>  Gabor

Reply via email to