On Sun, Aug 23, 2009 at 02:56:44PM +0400, Roman Makurin wrote:
> Hi All!
> 
> How can I tell HTML::TreeBuilder to parse invalid html files
> gracefully ? Here is an example:
> 
> -----
> #!/usr/bin/perl
> 
> use strict;
> use warnings;
> 
> use HTML::TreeBuilder;
> 
> my $root = HTML::TreeBuilder->new_from_file(*DATA);
> 
> print +($root->look_down(_tag=>'div', class=>'text'))->as_text, $/;
> 
> 
> __DATA__
> <html>
> <head>
> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
> </head>
> <body>
> <div class="body">
>   <div class="doc">
>     <p>some text
>     <div class="text">
>       <head>
>         <meta http-equiv="Content-Type" content="text/html; 
> charset=iso-8859-1"/>
>       </head>
>       <p> some other text      
>     </div>
>   </div>
> </div>
> </body>
> </html>
> --------
> 
> for some reason someone put head tag inside of div :)
> all browsers handle such case correctly, but HTML::TreeBuilder
> returns undefined text value if I use as_text method on
> <div class="text">. Without inner head section all works
> as expected.
> 
> Is there any way to tell HTML::TreeBuilder to handle
> such situations ?
> 
> 
> Thanks.
> 
> -- 
> If you think of MS-DOS as mono, and Windows as stereo,
>  then Linux is Dolby Digital and all the music is free...

Just found solution, just set parser to ignore some tags:

$root = new HTML::TreeBuilder;
$root->ignore_tags(qw/head meta links style/);
$root->parse_file(*FH);


dunno is it bestm but it work :)

-- 
If you think of MS-DOS as mono, and Windows as stereo,
 then Linux is Dolby Digital and all the music is free...

Attachment: signature.asc
Description: Digital signature

Reply via email to