On 15Aug2016 1819, eryk sun wrote:
On Mon, Aug 15, 2016 at 6:26 PM, Steve Dower <steve.do...@python.org> wrote:
(Frankly I don't mind what encoding we use, and I'd be quite happy to force
bytes
paths to be UTF-16-LE encoded, which would also round-trip invalid surrogate
pairs. But that would prevent basic manipulation which seems to be a higher
priority.)
The CRT manually decodes and encodes using the private functions
__acrt_copy_path_to_wide_string and __acrt_copy_to_char. These use
either the ANSI or OEM codepage, depending on the value returned by
WinAPI AreFileApisANSI. CPython could follow suit. Doing its own
encoding and decoding would enable using filesystem functions that
will never get an [A]NSI version (e.g. GetFileInformationByHandleEx),
while still retaining backward compatibility.
Filesystem encoding could use WC_NO_BEST_FIT_CHARS and raise a warning
when lpUsedDefaultChar is true. Filesystem decoding could use
MB_ERR_INVALID_CHARS and raise a warning and retry without this flag
for ERROR_NO_UNICODE_TRANSLATION (e.g. an invalid DBCS sequence). This
could be implemented with a new "warning" handler for
PyUnicode_EncodeCodePage and PyUnicode_DecodeCodePageStateful. A new
'fsmbcs' encoding could be added that checks AreFileApisANSI to choose
betwen CP_ACP and CP_OEMCP.
None of that makes it less complicated or more reliable. Warnings based
on values are bad (they should be based on types) and using the *W APIs
exclusively is the right way to go. The question then is whether we
allow file system functions to return bytes, and if so, which encoding
to use. This then directly informs what the functions accept, for the
purposes of round-tripping.
*Any* encoding that may silently lose data is a problem, which basically
leaves utf-16 as the only option. However, as that causes other
problems, maybe we can accept the tradeoff of returning utf-8 and
failing when a path contains invalid surrogate pairs (which is extremely
rare by comparison to characters outside of CP_ACP)?
If utf-8 is unacceptable, we're back to the current situation and should
be removing the support for bytes that was deprecated three versions ago.
Cheers,
Steve
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/