+1 from me!

Mark

On 25/10/2011 9:57 AM, Victor Stinner wrote:
Hi,

I propose to raise Unicode errors if a filename cannot be decoded on Windows,
instead of creating a bogus filenames with questions marks. Because this change
is incompatible with Python 3.2, even if such filenames are unusable and I
consider the problem as a (Python?) bug, I would like your opinion on such
change before working on a patch.

--

Windows works internally on Unicode strings since Windows 95 (or something
like that), but provides also an "ANSI" API using the ANSI code page and byte
strings for backward compatibility. It was already proposed to drop completly
the bytes API in our nt (os) module, but it may break the Python backward
compatibility (and it is difficult to list Python programs using the bytes API
to access the file system).

The ANSI API uses MultiByteToWideChar (decode) and WideCharToMultiByte
(encode) functions in the default mode (flags=0): MultiByteToWideChar()
replaces undecodable bytes by '?' and WideCharToMultiByte() ignores
unencodable characters (!!!). This behaviour produces invalid filenames (see
for example the issue #13247) and *the user is unable to detect codec errors*.

In Python 3.2, I changed the MBCS codec to make it strict: it raises a
UnicodeEncodeError if a character cannot be encoded to the ANSI code page
(e.g. encode Ł to cp1252) and a UnicodeDecodeError if a character cannot be
decoded from the ANSI code page (e.g. b'\xff' from cp932).

I propose to reuse our MBCS codec in strict mode (error handler="strict"), to
notice directly encode/decode errors, with the Windows native (wide character)
API. It should simplify the source code: replace 2 versions of a function by 1
version + optional code to decode arguments and/or encode the result.

--

Read also the previous thread:

[Python-Dev] Byte filenames in the posix module on Windows
Wed Jun 8 00:23:20 CEST 2011
http://mail.python.org/pipermail/python-dev/2011-June/111831.html

--

FYI I patched again Python MBCS codec: it now handles correclty ignore and
replace mode (to encode and decode), but now also supports any error handler.

--

We might use the PEP 383 to store undecoable bytes as surrogates (U+DC80-
U+DCFF). But the situation is the opposite of the situtation on UNIX: on
Windows, the problem is more on encoding (text->bytes) than on decoding
(bytes->text). On UNIX, problems occur when the system is misconfigured (e.g.
wrong locale encoding). On Windows, problems occur when your application uses
the old (ANSI) API, whereas your filesystem is fully Unicode compliant and you
created Unicode filenames with a program using the new (Windows) API.

Only few programs are fully Unicode compliant. A lot of programs fail if a
filename cannot be encoded to the ANSI code page (just 2 examples: Mercurial
and Visual Studio).

Victor
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/skippy.hammond%40gmail.com

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to