2016-02-10 11:18 GMT+01:00 Steven D'Aprano <st...@pearwood.info>:
> [steve@ando ~]$ python3.3 -c 'print(open(b"/tmp/abc\xD8\x01", "r").read())'
> Hello World
>
> [steve@ando ~]$ python3.3 -c 'print(open("/tmp/abc\xD8\x01", "r").read())'
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
> FileNotFoundError: [Errno 2] No such file or directory: '/tmp/abcØ\x01'
>
> What Unicode string does one need to give in order to open file
> b"/tmp/abc\xD8\x01"?

Use os.fsdecode(b"/tmp/abc\xD8\x01") to get the filename as an Unicode
string, it will work.

Removing 'b' in front of byte strings is not enough to convert an
arbitrary byte strings to Unicode :-D Encodings are more complex than
that... See http://unicodebook.readthedocs.org/

The problem on Python 2 is that the UTF-8 encoders encode surrogate
characters, which is wrong. You cannot use an error handler to choose
how to handle these surrogate characters.

On Python 3, you have a wide choice of builtin error handlers, and you
can even write your own error handlers. Example with Python 3.6 and
its new "namereplace" error handler.

>>> def format_filename(filename, encoding='ascii', errors='backslashreplace'):
...     return filename.encode(encoding, errors).decode(encoding)
...

>>> print(format_filename(os.fsdecode(b'abc\xff')))
abc\udcff

>>> print(format_filename(os.fsdecode(b'abc\xff'), errors='replace'))
abc?

>>> print(format_filename(os.fsdecode(b'abc\xff'), errors='ignore'))
abc

>>> print(format_filename(os.fsdecode(b'abc\xff') + "é", errors='namereplace'))
abc\udcff\N{LATIN SMALL LETTER E WITH ACUTE}

My locale encoding is UTF-8.

Victor
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to