Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Toshio Kuratomi Tue, 28 Apr 2009 19:15:18 -0700

Zooko O'Whielacronx wrote:
> On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote:
>> If you switch to iso8859-15 only in the presence of undecodable UTF-8,
>> then you have the same round-trip problem as the PEP: both b'\xff' and
>> b'\xc3\xbf' will be converted to u'\u00ff' without a way to
>> unambiguously recover the original file name.
> 
> Why do you say that?  It seems to work as I expected here:
> 
>>>> '\xff'.decode('iso-8859-15')
> u'\xff'
>>>> '\xc3\xbf'.decode('iso-8859-15')
> u'\xc3\xbf'
>>>>
>>>>
>>>>
>>>> '\xff'.decode('cp1252')
> u'\xff'
>>>> '\xc3\xbf'.decode('cp1252')
> u'\xc3\xbf'
>


You're not showing that this is a fallback path.  What won't work is
first trying a local encoding (in the following example, utf-8) and then
if that doesn't work, trying a one-byte encoding like iso8859-15:

try:
    file1 = '\xff'.decode('utf-8')
except UnicodeDecodeError:
    file1 = '\xff'.decode('iso-8859-15')
print repr(file1)

try:
    file2 = '\xc3\xbf'.decode('utf-8')
except UnicodeDecodeError:
    file2 = '\xc3\xbf'.decode('iso-8859-15')
print repr(file2)


That prints:
  u'\xff'
  u'\xff'

The two encodings can map different bytes to the same unicode code point
 so you can't do this type of thing without recording what encoding was
used in the translation.

-Toshio

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Reply via email to