Re: [Libre-fm] Dinosaurs migrating from last.fm?

Toby Inkster Wed, 13 May 2009 03:33:14 -0700

On Tue, 2009-05-12 at 16:00 -0600, Gordon Haverland wrote:
> HTML Tidy is freely available for many platforms, and is fast (I 
> think it is written in C).  If I run Tidy to increment and clean, 
> and to assume UTF8 input and generate UTF8 output. I almost get a 
> file which XML::Twig will process.  There are a couple of 
> attributes of elements in the page which are empty (such as 
> alt=""), which XML::Twig thinks are duplicates.  And XML::Twig 
> doesn't understand &nbsp; for some reason.  Deleting the empty 
> attributes from the text of the page, and changing &nbsp; into a 
> space are enough to get XML::Twig to parse the file.


XML::Twig doesn't understand &nbsp; because by default it ignores
DOCTYPEs, and in XML, that's where the entity names (like 'nbsp') are
defined! It's possible to remedy this using the options 'load_DTD' and
'expand_external_ents'.

Generally speaking, the best way of parsing tag soup HTML in Perl is to
use HTML::TreeBuilder. 

If you then want it in a proper DOM tree (which is useful if you're
familiar with handling HTML in Javascript, as it enables you to use
familiar methods like getElementsByTagName), then use HTML::Element's
as_XML method to dump out the tree as an XML string and then slurp that
up with XML::LibXML::Parser's parse_html_string method.

It's rare to find a page which is so broken that this method fails.

Really though, someone needs to implement the HTML5 parsing algorithm in
Perl.

-- 
Toby Inkster <[email protected]>
_______________________________________________
Libre-fm mailing list
[email protected]
http://lists.autonomo.us/mailman/listinfo/libre-fm

Re: [Libre-fm] Dinosaurs migrating from last.fm?

Reply via email to