Bugs item #1599325, was opened at 2006-11-19 20:40
Message generated for change (Comment added) made by loewis
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599325&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: None
>Status: Closed
>Resolution: Invalid
Priority: 5
Private: No
Submitted By: Erik Demaine (edemaine)
Assigned to: Nobody/Anonymous (nobody)
Summary: htmlentitydefs.entitydefs assumes Latin-1 encoding

Initial Comment:
The code in htmlentitydefs.py that sets entitydefs uses chr whenever the 
codepoint is <= 0xff.  This should be <= 0x7f.

As it currently stands, htmlentitydefs.entitydefs['nbsp'] == '\xa0'.  But this 
is only "true" in the Latin-1 encoding.  For example, in UTF8, the same 
character (u'\xa0') would be encoded '\xc2\xa0'.  While I think it is 
reasonable for entitydefs to use the ASCII codec for characters encodable in 
that codec (<= 0x7f), I do not think it is reasonable to assume Latin-1 
encoding.

This issue affects sgmllib.SGMLParser, for example, when handle_entityref calls 
handle_data.  The passed data can be '\xa0', which handle_data is forced to 
assume is Latin-1, when the source string might be encoded otherwise.

----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2006-11-19 20:59

Message:
Logged In: YES 
user_id=21627
Originator: NO

This is not a bug. entitydefs is specified to contain Latin-1 byte strings
in its documentation, and many applications rely on that.

If you have different processing needs, you may want to use
htmlentitydefs.name2codepoint instead, or derive yet another table
automatically from it.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599325&group_id=5470
_______________________________________________
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to