On May 13, 2009, Toby Inkster wrote: > On Tue, 2009-05-12 at 16:00 -0600, Gordon Haverland wrote: > > HTML Tidy is freely available for many platforms, and is fast > > (I think it is written in C). If I run Tidy to increment and > > clean, and to assume UTF8 input and generate UTF8 output. I > > almost get a file which XML::Twig will process. There are a > > couple of attributes of elements in the page which are empty > > (such as alt=""), which XML::Twig thinks are duplicates. And > > XML::Twig doesn't understand for some reason. > > Deleting the empty attributes from the text of the page, and > > changing into a space are enough to get XML::Twig to > > parse the file. > > XML::Twig doesn't understand because by default it > ignores DOCTYPEs, and in XML, that's where the entity names > (like 'nbsp') are defined! It's possible to remedy this using > the options 'load_DTD' and 'expand_external_ents'.
I think you've used Twig more than I have. :-) > Generally speaking, the best way of parsing tag soup HTML in > Perl is to use HTML::TreeBuilder. A long time ago, I was trying to parse some HTML junk and none of the Perl parsers could deal with it. I ended up getting Tidy to clean things, then deal with it in Perl. It's possible that HTML::Parser is better these days. Hopefully I'll have the Twig/XPath stuff working soon. Gord _______________________________________________ Libre-fm mailing list [email protected] http://lists.autonomo.us/mailman/listinfo/libre-fm
