On Sun, Aug 23, 2009 at 02:56:44PM +0400, Roman Makurin wrote: > Hi All! > > How can I tell HTML::TreeBuilder to parse invalid html files > gracefully ? Here is an example: > > ----- > #!/usr/bin/perl > > use strict; > use warnings; > > use HTML::TreeBuilder; > > my $root = HTML::TreeBuilder->new_from_file(*DATA); > > print +($root->look_down(_tag=>'div', class=>'text'))->as_text, $/; > > > __DATA__ > <html> > <head> > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/> > </head> > <body> > <div class="body"> > <div class="doc"> > <p>some text > <div class="text"> > <head> > <meta http-equiv="Content-Type" content="text/html; > charset=iso-8859-1"/> > </head> > <p> some other text > </div> > </div> > </div> > </body> > </html> > -------- > > for some reason someone put head tag inside of div :) > all browsers handle such case correctly, but HTML::TreeBuilder > returns undefined text value if I use as_text method on > <div class="text">. Without inner head section all works > as expected. > > Is there any way to tell HTML::TreeBuilder to handle > such situations ? > > > Thanks. > > -- > If you think of MS-DOS as mono, and Windows as stereo, > then Linux is Dolby Digital and all the music is free...
Just found solution, just set parser to ignore some tags: $root = new HTML::TreeBuilder; $root->ignore_tags(qw/head meta links style/); $root->parse_file(*FH); dunno is it bestm but it work :) -- If you think of MS-DOS as mono, and Windows as stereo, then Linux is Dolby Digital and all the music is free...
signature.asc
Description: Digital signature