On Monday 16 March 2009 00:35:47 Stuart Grimshaw wrote:
> I'm trying to parse a web page for the local council using lxml and
> when it gets as far as parsing the html the program just hangs, this
> test script is below, and, well I'm stumped. It even freezes if I
> stick the BBCs site in.
>
> What am I missing?
I'd guess it's a web proxy issue.
The way to check is to look at the output of "netstat -nat", and see whether
you have any connections in a SYN_SENT state. If you're under linux, as the
right user, you can also do "netstat -natp" to confirm you have the right
process.
Try doing the r.read() before the html.parse line to confirm :)
Beyond that your code fails like this here:
Traceback (most recent call last):
File "pynw.py", line 23, in <module>
main()
File "pynw.py", line 20, in main
self.dom = html.parse(r.read())
File
"/usr/local/lib/python2.5/site-packages/lxml-2.1.5-py2.5-linux-i686.egg/lxml/html/__init__.py",
line 651, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 2583, in lxml.etree.parse
(src/lxml/lxml.etree.c:25057)
File "parser.pxi", line 1465, in lxml.etree._parseDocument
(src/lxml/lxml.etree.c:63523)
File "parser.pxi", line 1494, in lxml.etree._parseDocumentFromURL
(src/lxml/lxml.etree.c:63767)
File "parser.pxi", line 1394, in lxml.etree._parseDocFromFile
(src/lxml/lxml.etree.c:62923)
File "parser.pxi", line 968, in lxml.etree._BaseParser._parseDocFromFile
(src/lxml/lxml.etree.c:60309)
File "parser.pxi", line 542, in
lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:56659)
File "parser.pxi", line 628, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:57504)
File "parser.pxi", line 566, in lxml.etree._raiseParseError
(src/lxml/lxml.etree.c:56876)
IOError: Error reading file '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- // EasySite CMS v5 // -->
<!-- // EIBS Ltd - Simplifying eContent // -->
<!-- // Content Management, Document Managment, eCommerce and Online
Publishing // -->
<!-- // Unit 3, Wilford Business Park, Ruddington Lane, Nottingham, NG11 7EP,
United Kingdom // -->
<!-- // http://www.eibs.co.uk/ - [email protected] - +44 (0) 8700 129
029 // -->
<!-- // Copyright 1999 - 2009 EIBS Ltd // -->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
... followed by body of page.
Regards,
Michael.
--
http://yeoldeclue.com/blog
http://twitter.com/kamaelian
http://www.kamaelia.org/Home
--~--~---------~--~----~------------~-------~--~----~
To post: [email protected]
To unsubscribe: [email protected]
Feeds available at http://groups.google.com/group/python-north-west/feeds
For more options: http://groups.google.com/group/python-north-west
-~----------~----~----~----~------~----~------~--~---