On Aug 22, 2010, at 15:25 , Gabor Szabo wrote: > I am not sure where to post this HTML::Parser related question but > this list seems to be close and Gisle is frequent here (and the Perl > Monks did not give me a solution) so I hope you'll forgive me:
No reason to forgive :-) This list is the recommended place to raise issued with the HTML-Parser dist as well. > Using HTML:Parser it is unclear to me how am I supposed to notice when > a tag - that's end tag is missing has indeed ended? It seems that in > some cases I get an explicit end event but in other cases I don't. The general rule is that HTML::Parser give you back a stream of tags that correspond exactly to the text of the parsed document. If you want implicit tags to be intuited you have to use modules like those found in the HTML-Tree dist. But there are seven tags that HTML::Parser treat specially; <script>, <style>, <xmp>, <iframe>, <plaintext>, <title> and <textarea>. After one of these start tags have been seen no other tag is recognized until the corresponding end tag is seen. What you experienced with <title> is the heuristics HTML::Parser uses to recover when it did not find any "</title>" before it hit the end of the document. In this case it inserts a fake </title> just before the first tag found after <title>. The rules for recovery differs between the special tags. For <script> and <style> an empty element is preferred. For <plaintext>, <xmp>, <iframe>, <textarea> all the rest of the document is considered text. Hope this helps! Regards, Gisle > > See the example code: > > use strict; > use warnings; > > use HTML::Parser (); > > sub event_handler { > my ($event, $elem) = @_; > print "$event $elem\n"; > } > > > my $p = HTML::Parser->new(api_version => 3); > $p->handler( start => \&event_handler, "event, tagname"); > $p->handler( end => \&event_handler, "event, tagname"); > $p->parse('<head><title>abc</title></head>'); > $p->eof; > print "----\n"; > $p->parse('<head><title>abc</head>'); > $p->eof; > > print "----\n"; > $p->parse('<ul><li>abc</li><li>def</ul>'); > $p->eof; > exit; > > The result of which is > > start head > start title > end title > end head > ---- > start head > start title > end title > end head > ---- > start ul > start li > end li > start li > end ul > > That is, the missing </title> tag explicitly generated and end-even > while the missing </li> did not. > > > regards > Gabor