回复： getroot() problem

水静流深 Sun, 23 Oct 2011 21:34:14 -0700

in  my computer,there two os ,
1.xp+python32
import lxml.html
sfile='http://finance.yahoo.com/q/op?s=A+Options' 
root=lxml.html.parse(sfile).getroot()
 it is ok
 import lxml.html
sfile='http://frux.wikispaces.com/'
root=lxml.html.parse(sfile).getroot()
there is problem


Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\lib\site-packages\lxml\html\__init__.py", line 692, in parse

    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:5
4187)
  File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etre
e.c:79485)
  File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lx
ml.etree.c:79768)
  File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile (src/lxml/lxml.e
tree.c:78843)
  File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile (src/
lxml/lxml.etree.c:75698)
  File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDo
c (src/lxml/lxml.etree.c:71739)
  File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.e
tree.c:72614)
  File "parser.pxi", line 583, in lxml.etree._raiseParseError (src/lxml/lxml.etr
ee.c:71927)
IOError: Error reading file 'b'http://frux.wikispaces.com/'': b'failed to load e
xternal entity "http://frux.wikispaces.com/";'

2. ubuntu11.04+python2.6
import lxml.html
 sfile='http://frux.wikispaces.com/'
 root=lxml.html.parse(sfile).getroot()
it is ok
it is so strange thing for me to understand
------------------ 原始邮件 ------------------
发件人: "Dave Angel"<[email protected]>;
发送时间: 2011年10月24日(星期一) 上午9:22
收件人: "1248283536"<[email protected]>; 
抄送: "lxml"<[email protected]>; "python-list"<[email protected]>; 
主题: Re: getroot()   problem

 
 On 10/23/2011 09:06 PM, 水静流深 wrote:
> C:\Documents and Settings\peng>cd c:\python32
>
>
>
> C:\Python32>python
>
> Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on 
> win
>
> 32
>
> Type "help", "copyright", "credits" or "license" for more information.
>
>>>> import lxml.html
>
>>>> sfile='http://finance.yahoo.com/q/op?s=A+Options'
>
>>>> root=lxml.html.parse(sfile).getroot()
> there is no problem to  parse  :
>
>
> http://finance.yahoo.com/q/op?s=A+Options'
>
>
>
>
> why  i can not parse
>
> http://frux.wikispaces.com/  ??
>
>>>> import lxml.html
>
>>>> sfile='http://frux.wikispaces.com/'
>
>>>> root=lxml.html.parse(sfile).getroot()
>
> Traceback (most recent call last):
>
>    File "<stdin>", line 1, in<module>
>
>    File "C:\Python32\lib\site-packages\lxml\html\__init__.py", line 692, in 
> parse
>
>
>
>      return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
>
>    File "lxml.etree.pyx", line 2942, in lxml.etree.parse 
> (src/lxml/lxml.etree.c:5
>
> 4187)
>
>    File "parser.pxi", line 1528, in lxml.etree._parseDocument 
> (src/lxml/lxml.etre
>
> e.c:79485)
>
>    File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL 
> (src/lxml/lx
>
> ml.etree.c:79768)
>
>    File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile 
> (src/lxml/lxml.e
>
> tree.c:78843)
>
>    File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile 
> (src/
>
> lxml/lxml.etree.c:75698)
>
>    File "parser.pxi", line 564, in 
> lxml.etree._ParserContext._handleParseResultDo
>
> c (src/lxml/lxml.etree.c:71739)
>
>    File "parser.pxi", line 645, in lxml.etree._handleParseResult 
> (src/lxml/lxml.e
>
> tree.c:72614)
>
>    File "parser.pxi", line 583, in lxml.etree._raiseParseError 
> (src/lxml/lxml.etr
>
> ee.c:71927)
>
> IOError: Error reading file 'b'http://frux.wikispaces.com/'': b'failed to 
> load e
>
> xternal entity "http://frux.wikispaces.com/";'
>
>>> >
Double-spacing makes your message much harder to read. I can only 
comment in a general way, in any case. most html is mal-formed, and not 
legal html. Although I don't have any experience with parsing it, I do 
with xml which has similar problems.

The first thing I'd do is to separate the loading of the byte string 
from the website, from the parsing of those bytes. Further, I'd make a 
local copy of those bytes, so you can do testing repeatably. For 
example, you could run wget utility to copy the bytes locally and create 
a file.
-- 

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list

回复： getroot() problem

Reply via email to