Bugs item #1599325, was opened at 2006-11-19 14:40 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599325&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Erik Demaine (edemaine) Assigned to: Nobody/Anonymous (nobody) Summary: htmlentitydefs.entitydefs assumes Latin-1 encoding Initial Comment: The code in htmlentitydefs.py that sets entitydefs uses chr whenever the codepoint is <= 0xff. This should be <= 0x7f. As it currently stands, htmlentitydefs.entitydefs['nbsp'] == '\xa0'. But this is only "true" in the Latin-1 encoding. For example, in UTF8, the same character (u'\xa0') would be encoded '\xc2\xa0'. While I think it is reasonable for entitydefs to use the ASCII codec for characters encodable in that codec (<= 0x7f), I do not think it is reasonable to assume Latin-1 encoding. This issue affects sgmllib.SGMLParser, for example, when handle_entityref calls handle_data. The passed data can be '\xa0', which handle_data is forced to assume is Latin-1, when the source string might be encoded otherwise. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599325&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com