STINNER Victor <victor.stin...@haypocalc.com> added the comment:

asvetlov> I'm skeptical about surrogates particularly for that 
asvetlov> problem. From my perspective the solution is only to use 
asvetlov> native unicode support for windows file operation functions.

It's not exclusive. We can use surrogates on POSIX and then convert to bytes at 
the system calls, and use the unicode version of the Windows API. In both 
cases, filenames are unicode.

asvetlov> Conversions utf-8 -> mbcs -> utf8 will loose encoding
asvetlov> information thanks to tricky Microsoft mbcs encoding schema.
asvetlov> If I'm wrong please correct me.

On Windows, Python3 *does* convert unicode to bytes with the mbcs encoding in 
the import machinery. I tested and Python3 has the same problem on Windows with 
non decodable filenames than Python3 on Unix. Eg. add "\u0809" character 
(random non encodable character) to the Python directory name: Python3 doesn't 
start if the code page cannot encode/decode it.

To fix all OS (Windows and POSIX), Python3 import machinery should not convert 
filenames to bytes but manipulate unicode characters and only convert filenames 
to bytes on POSIX at the last moment (at system calls).

--

mbcs codec ignores the error handler: it replaces unknown characters by "?" by 
default, see #850997.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8611>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to