On Tue, 2009-05-12 at 16:00 -0600, Gordon Haverland wrote: > HTML Tidy is freely available for many platforms, and is fast (I > think it is written in C). If I run Tidy to increment and clean, > and to assume UTF8 input and generate UTF8 output. I almost get a > file which XML::Twig will process. There are a couple of > attributes of elements in the page which are empty (such as > alt=""), which XML::Twig thinks are duplicates. And XML::Twig > doesn't understand for some reason. Deleting the empty > attributes from the text of the page, and changing into a > space are enough to get XML::Twig to parse the file.
XML::Twig doesn't understand because by default it ignores DOCTYPEs, and in XML, that's where the entity names (like 'nbsp') are defined! It's possible to remedy this using the options 'load_DTD' and 'expand_external_ents'. Generally speaking, the best way of parsing tag soup HTML in Perl is to use HTML::TreeBuilder. If you then want it in a proper DOM tree (which is useful if you're familiar with handling HTML in Javascript, as it enables you to use familiar methods like getElementsByTagName), then use HTML::Element's as_XML method to dump out the tree as an XML string and then slurp that up with XML::LibXML::Parser's parse_html_string method. It's rare to find a page which is so broken that this method fails. Really though, someone needs to implement the HTML5 parsing algorithm in Perl. -- Toby Inkster <[email protected]> _______________________________________________ Libre-fm mailing list [email protected] http://lists.autonomo.us/mailman/listinfo/libre-fm
