New submission from Kang-Hao (Kenny) Lu <kennyl...@csail.mit.edu>:

Since Python 3.2.2 (I don't have earlier version to test with),

>>> "\udc80".encode("utf-8")
UnicodeEncodeError: *utf-8* codec can't encode character '\udc80'...

but

>>> b"\xff".decode("utf-8")
UnicodeDecodeError: *utf8* codec can't decode byte 0xff in position 0

and the table on the documentation of the codec module suggests *utf_8* as the 
name of the codec, which I believe to be equivalent to "utf_8" because '-' is 
not a valid character of an identifier.

Can we at least make the above two consistent? I would go for "utf-8", which 
was probably introduced for rejecting surrogates, but "utf8" has been there for 
years. What do we do? I am happy to submit patches for all branches. These are 
one-liners anyway.

The backward compatibility risk should be pretty low as usually you don't get 
encoding from these errors and I don't see any use of 
PyUnicode(Encode|Decode)Error_GetEncoding in trunk, although I'm using it for 
issue #12892. 

Also, "latin_1" displays as *latin-1* but "iso2022-jp" displays as 
*iso2022_jp*. I care less about this nit though.

----------
components: Unicode
messages: 152399
nosy: ezio.melotti, kennyluck
priority: normal
severity: normal
status: open
title: utf-8 or utf8 or utf-8 (codec display name inconsistency)
versions: Python 2.7, Python 3.2, Python 3.3

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue13913>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to