Bugs item #1599325, was opened at 2006-11-19 20:40 Message generated for change (Comment added) made by loewis You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599325&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: None >Status: Closed >Resolution: Invalid Priority: 5 Private: No Submitted By: Erik Demaine (edemaine) Assigned to: Nobody/Anonymous (nobody) Summary: htmlentitydefs.entitydefs assumes Latin-1 encoding Initial Comment: The code in htmlentitydefs.py that sets entitydefs uses chr whenever the codepoint is <= 0xff. This should be <= 0x7f. As it currently stands, htmlentitydefs.entitydefs['nbsp'] == '\xa0'. But this is only "true" in the Latin-1 encoding. For example, in UTF8, the same character (u'\xa0') would be encoded '\xc2\xa0'. While I think it is reasonable for entitydefs to use the ASCII codec for characters encodable in that codec (<= 0x7f), I do not think it is reasonable to assume Latin-1 encoding. This issue affects sgmllib.SGMLParser, for example, when handle_entityref calls handle_data. The passed data can be '\xa0', which handle_data is forced to assume is Latin-1, when the source string might be encoded otherwise. ---------------------------------------------------------------------- >Comment By: Martin v. Löwis (loewis) Date: 2006-11-19 20:59 Message: Logged In: YES user_id=21627 Originator: NO This is not a bug. entitydefs is specified to contain Latin-1 byte strings in its documentation, and many applications rely on that. If you have different processing needs, you may want to use htmlentitydefs.name2codepoint instead, or derive yet another table automatically from it. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599325&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com