Bugs item #1651995, was opened at 2007-02-04 23:34
Message generated for change (Comment added) made by wrstlprmpft
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1651995&group_id=5470
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Nagle (nagle)
Assigned to: Nobody/Anonymous (nobody)
Summary: sgmllib _convert_ref UnicodeDecodeError exception, new in 2.
Initial Comment:
I'm running a website page through BeautifulSoup. It parses OK with Python
2.4, but Python 2.5 fails with an exception:
Traceback (most recent call last):
File "./sitetruth/InfoSitePage.py", line 268, in httpfetch
self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree form
File "./sitetruth/BeautifulSoup.py", line 1326, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "./sitetruth/BeautifulSoup.py", line 973, in __init__
self._feed()
File "./sitetruth/BeautifulSoup.py", line 998, in _feed
SGMLParser.feed(self, markup or "")
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
self.handle_starttag(tag, method, attrs)
File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
method(attrs)
File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta
self._feed(self.declaredHTMLEncoding)
File "./sitetruth/BeautifulSoup.py", line 998, in _feed
SGMLParser.feed(self, markup or "")
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal
not in range(128)
The code that's failing is in "_convert_ref", which is new in Python 2.5.
That function wasn't present in 2.4. I think the code is trying to handle
single quotes inside of double quotes in HTML attributes, or something like
that.
To replicate, run
http://www.bankofamerica.com
or
http://www.gm.com
through BeautifulSoup.
Something about this code doesn't like big companies. Web sites of smaller
companies are going through OK.
----------------------------------------------------------------------
Comment By: wrstl prmpft (wrstlprmpft)
Date: 2007-02-05 08:16
Message:
Logged In: YES
user_id=801589
Originator: NO
I had a similar problem recently and did not have time to file a
bug-report. Thanks for doing that.
The problem is the code that handles entity and character references in
SGMLParser.parse_starttag. Seems that it is not careful about unicode/str
issues.
(But maybe Beautifulsoup needs to tell it to?)
My quick'n'dirty workaround was to remove the offending char-entity from
the website before feeding it to Beautifulsoup::
text = text.replace('®', '') # remove rights reserved sign entity
cheers,
stefan
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1651995&group_id=5470
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com