Bugs item #1651995, was opened at 2007-02-04 22:34 Message generated for change (Comment added) made by nagle You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1651995&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Nagle (nagle) Assigned to: Nobody/Anonymous (nobody) Summary: sgmllib _convert_ref UnicodeDecodeError exception, new in 2. Initial Comment: I'm running a website page through BeautifulSoup. It parses OK with Python 2.4, but Python 2.5 fails with an exception: Traceback (most recent call last): File "./sitetruth/InfoSitePage.py", line 268, in httpfetch self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree form File "./sitetruth/BeautifulSoup.py", line 1326, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs) File "./sitetruth/BeautifulSoup.py", line 973, in __init__ self._feed() File "./sitetruth/BeautifulSoup.py", line 998, in _feed SGMLParser.feed(self, markup or "") File "/usr/lib/python2.5/sgmllib.py", line 99, in feed self.goahead(0) File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag self.finish_starttag(tag, attrs) File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag self.handle_starttag(tag, method, attrs) File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag method(attrs) File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta self._feed(self.declaredHTMLEncoding) File "./sitetruth/BeautifulSoup.py", line 998, in _feed SGMLParser.feed(self, markup or "") File "/usr/lib/python2.5/sgmllib.py", line 99, in feed self.goahead(0) File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag self._convert_ref, attrvalue) UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal not in range(128) The code that's failing is in "_convert_ref", which is new in Python 2.5. That function wasn't present in 2.4. I think the code is trying to handle single quotes inside of double quotes in HTML attributes, or something like that. To replicate, run http://www.bankofamerica.com or http://www.gm.com through BeautifulSoup. Something about this code doesn't like big companies. Web sites of smaller companies are going through OK. ---------------------------------------------------------------------- >Comment By: John Nagle (nagle) Date: 2007-02-07 07:57 Message: Logged In: YES user_id=5571 Originator: YES Found the problem. In sgmllib.py for Python 2.5, in convert_charref, the code for handling character escapes assumes that ASCII characters have values up to 255. But the correct limit is 127, of course. If a Unicode string is run through SGMLparser, and that string has a character in an attribute with a value between 128 and 255, which is valid in Unicode, the value is passed through as a character with "chr", creating a one-character invalid ASCII string. Then, when the bad string is later converted to Unicode as the output is assembled, the UnicodeDecodeError exception is raised. So the fix is to change 255 to 127 in convert_charref in sgmllib.py, as shown below. This forces characters above 127 to be expressed with escape sequences. Please patch accordingly. Thanks. def convert_charref(self, name): """Convert character reference, may be overridden.""" try: n = int(name) except ValueError: return if not 0 <= n <= 127 : # ASCII ends at 127, not 255 return return self.convert_codepoint(n) ---------------------------------------------------------------------- Comment By: wrstl prmpft (wrstlprmpft) Date: 2007-02-05 07:16 Message: Logged In: YES user_id=801589 Originator: NO I had a similar problem recently and did not have time to file a bug-report. Thanks for doing that. The problem is the code that handles entity and character references in SGMLParser.parse_starttag. Seems that it is not careful about unicode/str issues. (But maybe Beautifulsoup needs to tell it to?) My quick'n'dirty workaround was to remove the offending char-entity from the website before feeding it to Beautifulsoup:: text = text.replace('®', '') # remove rights reserved sign entity cheers, stefan ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1651995&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com