Re: [Libre-fm] Dinosaurs migrating from last.fm?

Gordon Haverland Wed, 13 May 2009 14:59:20 -0700

On May 13, 2009, Toby Inkster wrote:
> On Tue, 2009-05-12 at 16:00 -0600, Gordon Haverland wrote:
> > HTML Tidy is freely available for many platforms, and is fast
> > (I think it is written in C).  If I run Tidy to increment and
> > clean, and to assume UTF8 input and generate UTF8 output. I
> > almost get a file which XML::Twig will process.  There are a
> > couple of attributes of elements in the page which are empty
> > (such as alt=""), which XML::Twig thinks are duplicates.  And
> > XML::Twig doesn't understand &nbsp; for some reason. 
> > Deleting the empty attributes from the text of the page, and
> > changing &nbsp; into a space are enough to get XML::Twig to
> > parse the file.
>
> XML::Twig doesn't understand &nbsp; because by default it
> ignores DOCTYPEs, and in XML, that's where the entity names
> (like 'nbsp') are defined! It's possible to remedy this using
> the options 'load_DTD' and 'expand_external_ents'.


I think you've used Twig more than I have.  :-)

> Generally speaking, the best way of parsing tag soup HTML in
> Perl is to use HTML::TreeBuilder.

A long time ago, I was trying to parse some HTML junk and none of 
the Perl parsers could deal with it.  I ended up getting Tidy to 
clean things, then deal with it in Perl.  It's possible that 
HTML::Parser is better these days.

Hopefully I'll have the Twig/XPath stuff working soon.

Gord
_______________________________________________
Libre-fm mailing list
[email protected]
http://lists.autonomo.us/mailman/listinfo/libre-fm

Re: [Libre-fm] Dinosaurs migrating from last.fm?

Reply via email to