HTML::HeadParser: patch to handling</span></a></span> </h1> <p class="darkgray font13"> <span class="sender pipe"><a href="/search?l=libwww@perl.org&q=from:%22Edward+Avis%22" rel="nofollow"><span itemprop="author" itemscope itemtype="http://schema.org/Person"><span itemprop="name">Edward Avis</span></span></a></span> <span class="date"><a href="/search?l=libwww@perl.org&q=date:20000718" rel="nofollow">Tue, 18 Jul 2000 06:02:43 -0700</a></span> </p> </div> <div itemprop="articleBody" class="msgBody"> <!--X-Body-of-Message--> <PRE> This is a patch to HTML::HeadParser to let it cope better with some of the badly written web pages out there. Although a web page should have only one <title> element inside its <head>, some web pages manage to have more than one. An example is <<A HREF="http://asu.info.apple.com/swupdates.nsf/artnum/n10465">http://asu.info.apple.com/swupdates.nsf/artnum/n10465</A>>, which begins like this: <HTML> <!-- Lotus-Domino (Release 4.6.2 (Intl) - 23 July 1998 on AIX) --> <HEAD> <TITLE>Apple - Software Updates for LaserWriter Software 8.5.1</TITLE> [lots of <META> stuff snipped] <TITLE></TITLE> </HEAD> In a web browser, the first title is displayed. But with HTML::HeadParser, only the last title counts - so you get back an empty string. Now of course you could argue that this is reasonable. If people write invalid HTML, that's their problem. But the de facto standard is what the browser displays, and both Netscape 4.72 and MSIE 5.00 display the first title. A general solution would be to aggregate all the <title> elements together into a single title. Here is a patch: *** HeadParser.pm.orig Thu Dec 9 19:07:33 1999 --- HeadParser.pm Tue Jul 18 12:45:30 2000 *************** *** 135,141 **** $text =~ s/\s+/ /g; print "FLUSH $tag => '$text'\n" if $DEBUG; if ($tag eq 'title') { ! $self->{'header'}->header(Title => $text); } $self->{'tag'} = $self->{'text'} = ''; } --- 135,156 ---- $text =~ s/\s+/ /g; print "FLUSH $tag => '$text'\n" if $DEBUG; if ($tag eq 'title') { ! my $old_title = $self->{'header'}->header('Title'); ! my $new_title; ! if (defined $old_title and $old_title !~ /^\s*$/) { ! # Some badly written pages have more than one title, but ! # some titles may be empty. Attempt to sort things out. ! if ($text !~ /^\s*$/) { ! $new_title = "$old_title // $text"; ! } ! else { ! $new_title = $old_title; ! } ! } ! else { ! $new_title = $text; ! } ! $self->{'header'}->header(Title => $new_title); } $self->{'tag'} = $self->{'text'} = ''; } This handles well written web pages with only one title, badly written ones like the Apple/Domino one above, and (I hope) even worse ones. I don't subscribe to this list (I just stumbled across the problem by accident), so please cc: replies to me. -- Ed Avis [EMAIL PROTECTED] </PRE> </div> <div class="msgButtons margintopdouble"> <ul class="overflow"> <li class="msgButtonItems"><a class="button buttonleft " accesskey="p" href="msg01015.html">Previous message</a></li> <li class="msgButtonItems textaligncenter"><a class="button" accesskey="c" href="index.html#01017">View by thread</a></li> <li class="msgButtonItems textaligncenter"><a class="button" accesskey="i" href="maillist.html#01017">View by date</a></li> <li class="msgButtonItems textalignright"><a class="button buttonright " accesskey="n" href="msg01018.html">Next message</a></li> </ul> </div> <a name="tslice"></a> <div class="tSliceList margintopdouble"> <ul class="icons monospace"> <li class="icons-email"><span class="subject"><a href="msg01018.html">Re: HTML::HeadParser: patch to <title> handling</a></span> <span class="sender italic">Edward Avis</span></li> <li><ul> <li class="icons-email"><span class="subject"><a href="msg01018.html">Re: HTML::HeadParser: patch to <title> handling</a></span> <span class="sender italic">gisle</span></li> <li><ul> <li class="icons-email"><span class="subject"><a href="msg01019.html">Re: HTML::HeadParser: patch to <title> handling</a></span> <span class="sender italic">Edward Avis</span></li> </ul> </ul> </ul> </div> <div class="overflow msgActions margintopdouble"> <div class="msgReply" > <h2> Reply via email to </h2> <form method="POST" action="/mailto.php"> <input type="hidden" name="subject" value="HTML::HeadParser: patch to <title> handling"> <input type="hidden" name="msgid" value="Pine.LNX.4.21.0007181352070.7032-100000@pixel13.doc.ic.ac.uk"> <input type="hidden" name="relpath" value="libwww@perl.org/msg01017.html"> <input type="submit" value=" Edward Avis "> </form> </div> </div> </div> <div class="aside" role="complementary"> <div class="logo"> <a href="/"><img src="/logo.png" width=247 height=88 alt="The Mail Archive"></a> </div> <form class="overflow" action="/search" method="get"> <input type="hidden" name="l" value="libwww@perl.org"> <label class="hidden" for="q">Search the site</label> <input class="submittext" type="text" id="q" name="q" placeholder="Search libwww"> <input class="submitbutton" name="submit" type="image" src="/submit.png" alt="Submit"> </form> <div class="nav margintop" id="nav" role="navigation"> <ul class="icons font16"> <li class="icons-home"><a href="/">The Mail Archive home</a></li> <li class="icons-list"><a href="/libwww@perl.org/">libwww - all messages</a></li> <li class="icons-about"><a href="/libwww@perl.org/info.html">libwww - about the list</a></li> <li class="icons-expand"><a href="/search?l=libwww@perl.org&q=subject:%22HTML%5C%3A%5C%3AHeadParser%5C%3A+patch+to+%3Ctitle%3E+handling%22&o=newest&f=1" title="e" id="e">Expand</a></li> <li class="icons-prev"><a href="msg01015.html" title="p">Previous message</a></li> <li class="icons-next"><a href="msg01018.html" title="n">Next message</a></li> </ul> </div> <div class="listlogo margintopdouble"> </div> <div class="margintopdouble"> </div> </div> </div> <div class="footer" role="contentinfo"> <ul> <li><a href="/">The Mail Archive home</a></li> <li><a href="/faq.html#newlist">Add your mailing list</a></li> <li><a href="/faq.html">FAQ</a></li> <li><a href="/faq.html#support">Support</a></li> <li><a href="/faq.html#privacy">Privacy</a></li> <li class="darkgray">Pine.LNX.4.21.0007181352070.7032-100000@pixel13.doc.ic.ac.uk</li> </ul> </div> </body> </html>