[issue10828] Cannot use nonascii utf8 in names of files imported from

STINNER Victor Sat, 08 Jan 2011 18:44:12 -0800

STINNER Victor <victor.stin...@haypocalc.com> added the comment:

> ANSI code page: cp1252 ...os.fsencode('ä') => b'\xe4'


Hum, I ran your example with a debugger, and ok, I now remember the whole thing.

I fixed Python to support non-ASCII characters (... only non-ASCII characters 
encodable to the ANSI code page for Windows) in the *search path*, not in the 
module name.

The import machinery encodes each search path to the filesystem encoding, but 
it encodes the module name to UTF-8. Concatenate two byte strings encoded to 
different encodings doesn't work (it leads to mojibake).

To fix this problem, there are two solutions:

 a) encode the module name to the fileystem encoding
 b) manipulate paths as unicode strings; to access the filesystem: use the wide 
character (unicode) API of Windows and encode paths to the filesystem encoding 
on UNIX/BSD

It is easier to implement (a) than (b), but (a) only gives you the support of 
paths and module names encodable to the ANSI code page.

(b) gives you the full unicode support because it never *encodes* paths to the 
filesystem encoding, but it may *decodes* paths from the filesystem encoding. 
Encode a path raises a UnicodeEncodeError on the first character not encodable 
to the ANSI code page, whereas decode a path never fails (except if the user 
manually changed its code page to a rare ANSI code page like UTF-8).

I implemented (b) in my import_unicode SVN branch, but as I wrote, I still have 
some work to merge this branch into py3k, and anyway I will wait for Python 3.3.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue10828>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue10828] Cannot use nonascii utf8 in names of files imported from

Reply via email to