Rob Manson wrote: > Here's a patch to prove that this is the problem using a quick and dirty > regex fix: > > 848d847 > < $html =~ s/\ \;//igm; > > I tried it on both a simple hcard like > http://microformats.org/wiki/User:RobManson and the full hcard page > (which is veeeeery slow to parse) http://microformats.org/wiki/hcard and > the patch fixes it.
Thanks for your hint. The XML::Parser module is able to fetch DTDs and use them, so should be able to handle expansion of named entities by itself -- the only problem was that I had disabled it, partly to cut down on bandwidth usage, but also because I thought it would break too many pages to validate them. Anyway, I've re-enabled it and this seems to have fixed more pages than it's broken. I'm guessing that XML::Parser does not validate based on the DTD -- it just uses them to expand entities. With regards to speed, that's because I'm using LWP::RobotUA instead of LWP::UserAgent. This downloads the robots.txt (and honours it) and also enforces a delay between each request. The delay is 1 minute by default though I set it to 10 seconds -- or at least I thought I did, but I was trying to set it in the LWP::RobotUA constructor function, which it seems does not work. The delay is now set to 5 seconds and works. This has made it significantly faster. New version (0.1-alpha2.1): Online: http://buzzword.org.uk/cognition/cognition-0.1-alpha2.1.pl Download: http://buzzword.org.uk/cognition/cognition-0.1-alpha2.1.txt This successfully parses both the pages you mentioned above. Thanks again, -- Toby A Inkster BSc (Hons) ARCS [Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux] [OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 22 days, 16:20.] Bottled Water http://tobyinkster.co.uk/blog/2008/02/18/bottled-water/ _______________________________________________ microformats-discuss mailing list microformats-discuss@microformats.org http://microformats.org/mailman/listinfo/microformats-discuss